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GENERATION AND SELECTION OF NOVEL DNA-BINDING 
PROTEINS AND POLYPEPTIDES 

5 

BACKGROUND OF THE INVENTION 
Field of 1:he Inven-fcion 

10 This invention relates to development of novel DNA- 

binding proteins and polypeptides by an iterative process 
of mutation, expression, selection, and amplification. The 
ability to create novel DNA-binding proteins will have far- 
reaching applications, including, but not limited to, use 

15 in: a) treating viral diseases, b) treating genetic 
diseases, c) preparation of novel biochemical reagents, and 
d) biotechnology to regulate gene expression in cell 
cultures. Several workers have shown that repressors 
derived from bacteria function when expressed in eukaryotic 

20 cells (BREN84, FIGG88, BR0W87, HUMC87 , HUMC88) , but none 
have shown how to generate proteins that bind sequence- 
specifically to a predetermined DNA sequence- For reviews 
of transcriptional control in eukaryotic cells, see STRU87, 
JONE87, and MANI87. The present application deals only 

25 with sequence-specific DNA-binding proteins, abbreviated 
DEP. 

Proteins, particularly repressors, having affinity for 
specific sites on DNA modulate transcription of genes. 

30 The best known are a group of proteins primarily studied in 
prokaryotes that contain the structural motif alpha- 
helix-turn-alpha-helix (H-T-H) (SAUE82, PAB084) . These 
proteins bind as dimers or tetramers to DNA at specific 
operator sequences that have approximately palindromic 

35 sequences. Contacts made by two adjacent alpha helices of 
each monomer in and around two sites in the major groove of 
B-form DNA are a major feature in the DNA-protein inter- 
face. This group of proteins includes phage repressor and 
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Cro proteins, bacterial metabolic repressors such as GalR, 
Lad, LexA, and TrpR, bacterial activator protein CAP and 
activator/repressor AraC, bacterial transposon and plasmid 
TetR proteins (PAB084) , the yeast mating type regulators 
5 M&Tal and MATalpha2 (MILL85) and eukaryotic homeo box 
proteins (EVAN88) . 

Interactions between dimeric repressors and approx- 
imately palindromic operators have usually been discussed 

10 in the literature with attention focused on one half of the 
operator with the tacit or explicit assumption that 
identical interactions occur in each half of the complex. 
Departures froia palindromic symmetry allow proteins to 
distinguish among multiple related operators (SADL83, 

15 SIM084) . One must view the DNA-protein interface as a 
whole. The emphasis in the literature on dyad symmetry is 
a barrier to determining the requirements for general novel 
recognition of DNA by proteins. 

20 The equilibrium geometry and flexibility of DNA are 

determined by the sequence; see inter alia H0GA87, GART88, 
and ULAN87. The interactions of ionic, polar, and hydro- 
phobic groups on the DNA with solvent molecules and ions 
make detailed predictions of DNA conformation and binding 

25 properties very difficult; cf . OHIiE85, ULAN87, and 0TWI8S. 

Matthews (MATT88) , commenting on the current collec- 
tion of protein-DNA structures, concludes that: a) 
different H-T-H DBFs use their recognition helices dif fer- 

30 ently, b) there is no simple code that relates particular 
base pairs to particular amino acids at specific locations 
in the DBF, and c) "full appreciation of the complexity 
and individuality of each complex will be discouraging to 
anyone hoping to find simple answers to the recognition 

35 problem." Schleif (SCHL88) has characterized the study of 
DNA-binding proteins as a field still in its infancy and 
emphasizes the difficulties of designing proteins that bind 
predetermined sequences. 



wo 90/07862 



PCr/US90/00024 



Prokaryotic repressors exist that are unrelated to 
H-T-H binding proteins. Some of these bind to approximate 
palindromic sequences ( e>a. Salmonella typhimurium phage 
5 P22 Mnt protein {VERS87a) and coli TyrR repressor 

protein (DEFE86)). Others bind to operator sequences that 
are partially symmetric (S^ tvphimurium phage P22 Arc 
protein, VERS87b; coli Fur protein, DEL087; plasmid R6K 
pi protein, FILU85) or non-symmetric (phage Mu repressor, 
10 KRAU86) . 

Genetics has enabled extensive analysis of prokaryotic 
DNA-binding proteins and their specific nucleic acid 
recognition sequences. It is not yet possible, however, to 
15 design a protein to bind strongly and specifically to an 
arbitrary DNA sequence. As taught by the present invention 
it is, nonetheless, possible to postulate a family of 
potential DBP mutants and identify one having the desired 
specificity by other means. 

20 

Genetic studies of the DNA-binding proteins show that 
mutations in protein sequence that result in decrease of 
protein function fall into two overlapping classes: 1) 
those that destabilize the global protein structure or 

25 folding and 2) those that specifically alter the binding 
properties. The first class illuminates the general 
problem of protein folding and stability, while the second 
defines the interactions involved in the formation and 
stabilization of the protein-DNA complex. Mutations in the 

30 operator yield additional information. 

Positions 84 to 91 in helix 5 of X repressor have been 
subjected to extensive amino acid substitutions (REID88) . 
Two or three positions were varied simultaneously through 
35 all twenty amino acids and those combinations giving normal 
function were selected. The authors neither discuss 
optimization of the number or positions of residues to vary 
to obtain any particular functionality, nor did they 
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attempt to obtain proteins having alternate dimerization or 
recognition functions. 

Pakula et al. (PAKUSS) have randomly mutagenized \ 
5 Cro. They sought and found non-functional mutants but did 
not seek or find proteins having novel DNA-binding proper- 
ties, nor did they suggest how to select such proteins. 

Sequence-independent DNA-protein interactions are 
10 thought to occur via electrostatic interactions between the 
backbone of the DNA and charged or polar groups of the 
protein (2^DE87, IJEWI83, and TAKE85) . Sequence-specific 
interactions involve H-bonding, nonpolar, or van der Waals 
contacts between exposed side groups or groups of the 
15 polypeptide main chain and base pair edges exposed in the 
major and minor grooves of the DNA. 

Mutations that alter residues involved in specific 
binding interactions with DNA have been identified in 
20 prokaryotic DBFs, including X/ 434, and P22 repressor and 
Cro proteins,. P22 Arc and Mht, and ^ coli tro and lac 
repressors and CAP. These mutations occur in residues that 
are exposed to solvent in the free protein but buried in 
the protein-DNA complex. 

25 

A few cases have been reported (BASS88, YOUD83, 
VERSSSa, CARU87, WHAR85b, WHAR87, EBRI84, and SPIR88) in 
which a change in one or a few residues in a DNA-binding 
protein not only abolishes binding by the protein to the 

30 wild-type operator but also confers strong binding to a 
different operator. - In all the cited publications, 
alteration of binding specificity has been accomplished by 
using symmetrically-located pairs of alterations in the 
operator sites. Single, asymmetric changes or multiple 

35 changes asymmetrically located in either the binding 
protein or its operator were not considered. In "helix 
swap" experiments (WHAR84, WHAR85b, WHARSSa, SPIR88, 
BUSH88, PAB084) , multiple mutations are introduced into the 
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DNA-binding recognition helix of H-T-H proteins with the 
goal of changing the operator specificity of one known DBF 
to that of a different known DBF, 

5 An extension of the "helix swap" experiments us.es a 

mixture of 434 repressor and 434R[alpha3 (F22R) ] (HOLL88) . 
This mixture recognizes and binds in vitro with high 
affinity to a 16 bp chimeric operator comprising a 434 
half -site and a P22 half -site, indicating that active 
10 heterodimers are formed. The authors did not extend the 
results to intracellular repression, nor did they perform 
mutagenesis of the repressors and selection of cells to 
create novel recognition patterns. 

15 Two approaches have been developed to create hovel 

proteins through reverse genetics. In one approach,, 
dubbed "protein surgery" (DILLS?), a substitution is 
introduced at a single protein residue. This approach has 
been used to determine the effects on structure and 

20 function of specific substitutions in trypsin (CRAI85, 
RAOS87, BASH87). 

The other approach has been to generate a variety of 
mutants at many loci within the cloned gene, the "gene- 

25 directed random mutagenesis" method. The specific location 
and nature of the change or changes are determined post hoc 
by DNA sequencing. If loss of a wild-type function confers 
a cellular phenotype, one screens colonies for mutations; 
see, FAKU86. This approach is limited by the number of 

3 0 colonies that can be examined. An additional important 
limitation is that many desirable protein alterations 
require multiple amino acid substitutions and thus are not 
accessible through single base changes or even through all 
possible amino acid substitutions at any one residue. 

35 

The objective in both these approaches has been, 
however, to analyze the effects of a variety of point 
iautations, so that rules governing such substitutions could 
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be developed (UIiME83). Progress has been hampered by the 
efforts involved in using either method (R0BE86) • 

dliphant et al> (0LIP86) and Oliphant and Stmhl 
5 (0LIP87) have dexaonstrated ligation and cloning of highly 
degenerate oligonucleotides and have applied saturation 
mutagenesis to the study of promoter sequence and function. 
They have suggested that similar methods could be used to 
study genetic expression of protein coding regions of 
10 genes, but they do not say how one should: a) choose 
protein residues to vary, or b) select or. screen mutants 
with desireible properties. 

Ward et al. (WARDS 6) have engineered heterodimers 
15 from homodimers of tyrosyl-tRNA synthetase. Methods of 
converting homodimeric DBPs into heterodimeric DBPs are 
disclosed in the present invention^ Methods of deriving 
single-polypeptide pseudo-dimeric DBPs from homodimeric 
DBPs are disclosed in the examples of the present inven- 
20 tion. 

Benson efe al. (BENS86) have developed a scheme to 
detect genes for sequence-specific DNA-binding proteins. 
They do not consider non-symmetric target DNA sequences nor 

25 do they suggest mutagenesis to generate novel DNA-binding 
properties. Their method is presented as a method to 
detect genes for naturally occurring DNA-binding proteins. 
Because the selective system is lytic growth of phage, low 
levels of repression can not be detected. Selective 

30 chemicals, as disclosed in the present application, on the 
other hand, can be finely modulated so that low level 
repression is detectable. 

Elledge and Davis (ELLESSa) and Elledge et al. 
35 (ELIjE89b) have used an occluded aadA gene in a selection 
for cells expressing eukaryotic DBPs. The supposed 
recognition sequence of the sought DBP is incorporated into 
the strong promoter that occludes aadA on a low-copy number 



wo 90/07862 



PCr/US90/00024 



7 

plasmid. Their system is presented as a tool for cloning 
pre-existing DBFs and there is no mention of variegation of 
the gene that encodes the potential DBF. Furthermore, 
there is no discussion of the symmetry of the target 
5 sequence or of the symmetry of the DBF. 

Ladner and Bird, WO88/06601 suggest strategies for the 
preparation of asymmetric repressors. In one embodiment, a 
gene is constructed that encodes, as a single polypeptide 

10 chain, the two DNA-binding domains of a naturally-occurring 
dimeric repressor, joined by a polypeptide linker that 
holds the two binding domains in the necessary spatial 
relationship for binding to an operator. While they prefer 
to design the linker based on protein structural data ( cf ■ 

15 Ladner, U.S. Fatent 4,704,692) they state that uncertain- 
ties in the design of the linker may be resolved by 
generating a family of synthetic genes, differing in the 
linker-encoding subsequence, and selecting in vivo for a 
gene encoding the desired pseudo-dimer. Ladner and Bird do 

20 not consider the background of false positives that would 
arise if the two-domain polypeptides dimerize to form 
pseudo-tetf amers . 

The binding of lambdoid repressors, Cro and CI re- 
25 pressor, is taken, in W088/06601, as canonical even though 
other DBFs were known having operators of different 
lengths. WO88/06601 maintains that the 17 bp lambdoid 
operators can be divided into three regions: a) a left arm 
of five bases, b) a central region of seven bases, and c) a 
30 right arm of five bases. Several other DBFs are known for 
which this division is inappropriate. Further, WO88/O6601 
states that the sequence and composition of the central 
region, in which edges of bases are not contacted by the 
DBF, are immaterial. There is direct evidence for 434 
35 repressor (KOUD87, K0UD88) that the sequence and composi- 
tion of the central region strongly influences binding of 
434 repressor. 
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Once a pseudo-dimer Is obtained, they then obtain an 
asymmetric pseudo-dimer by the following t chnique. 
First, the user of W088/06601 is directed to construct a 
family of hybrid operators in which the sequence of the 

' 5 left and right arms are specified; no specification is 
given for the central seven bases. In each member of the 
family, the left arm contains the same sequence as the 
wild-type operator left arm while the right arm 5-mer is 
systematically varied through all 1024 possibilities. 

10 Similarly, in the gene encoding the pseudodimer, the codons 
for one recognition helix have the wild-type sequence while 
the codons coding for the other recognition helix are 
highly varied^ The variegated pseudodimer genes are 
ejcpressed in bacterial cells, wherein the hybrid operators 

15 are positioned to repress a single highly deleterious gene. 
Thus, it is supposed that one can identify a recognition 
helix for each possible 5-mer right arm of the operator by 
In vivo selection; the correspondences between 5-mer right 
arms and sequences of recognition helices are compiled 

20 into a dictionary. The consequences of mutations or 
deletions in the deleterious genes are not considered. 
W088/06601 suggests that successful constructions may be 
very rare, e.g. one in 10^, but ignore other genetic 
events of similar or greater frequency. 

25 

To obtain a repressor for an arbitrary 17-mer operat- 
or, the user of W088/ 06601 r 

a) finds the 5^mer sequence of the left arm in the 
30 dictionary and uses the corresponding recognition 

helix sequience in the first DNA-binding domain of the 
pseudodimer, ' ' ' 

b) ignores the sequence and composition of the next seven 
35 bases, and 

c) finds the 5-^mer sequence of the right arm in the 
dictionary and uses the corresponding recognition 



wo 90/07862 



PCr/US90/00024 



9 

helix sequence in the s cond DNA-binding domain of the 
pseudodimer. 

WO88/06601 also envisions means for producing a 
5 heterodlmeric repressor. A plasmid is provided that 
carries genes encoding two different repressors. A 
population of such plasmids is generated in which some 
codons are varied in each gene. W088/06601 instructs the 
user to introduce very high levels of variegation without 

10 regard to the number of independent transformants that can 
be produced. W08 8/06601 also Instructs the user to 
introduce variegation at widely separated sites in the 
gene, though there is no teaching concerning ways to 
simultaneously introduce high levels of variegation at 

15 widely separated sites in the gene or concerning main- 
tenance of diversity without selective pressure, as would 
be needed if the variegation were Introduced stepwise. 
WO88/O6601 teaches that codons thought to be involved in 
the protein-protein interface should be preferentially 

20 mutated to generate heterodimers. Cells transformed with 
this population of plasmids will produce both the desired 
heterodimer and the two "wild-type" homodimers. WO88/06601 
advises that one select for production of the heterodimer 
by providing a highly deleterious gene controlled by a 

25 hybrid operator, and beneficial genes controlled by the 
wild-type operators. The fastest growing cells, it is 
taught, will be those that produce a great deal of the 
heterodimer (which blocks expression of the . deleterious 
gene) and little of the homodimers (so that the beneficial 

30 genes are more fully expressed) • There is no consideration 
of mutations or deletions in the deleterious gene or in the 
wild-type operators; such mutations will produce a back- 
ground of fast-growing cells that do not contain the 
desired heterodimers. 
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SUMMARY OF THE INVENTION 

This invention relates to the development of novel 
proteins or polypeptides that preferentially bind to a 
5 specific subsequence of double-stranded DNA (the "target") 
which need not be symmetric, using a novel scheme for in 
vivo selection of mutant proteins exhibiting the desired 
binding specificities. 

10 The novel binding proteins or polypeptides may be 

obtained by mutating a gene encoding on expression: 1) a 
known DNA-binding protein within the subsequence encoding a 
known DNA-binding domain, 2) a protein that^ while not 
possessing a known DNA-binding activity, possesses a 

15 secondary or higher order structure that lends itself to 
binding activity (clefts, groovies, helices, etc. ) . 3) a 
known DNA-binding protein but not in the subsequence known 
to cause the binding,, or 4) a polypeptide having no known 
3D structure of its own. 

20 

This application uses the term "variegated DNA" to 
refer to a population of molecules that have the same base 
sequence through most of their length, but that vary at a 
number of defined loci. Using standard genetic engineering 

25 techniques, variegated DNA can be introduced into a plasmid 
so that it constitutes part of a gene (OLIP86, OLIP87, 
CHENS 8, AUSU87, REID88) • When plasmids containing varie- 
gated DNA are used to transform bacteria, each cell makes a 
version of the original protein. Each colony of bacteria 

30 produces a different version from most other colonies. If 
the variegations of the DNA are concentrated at loci that 
code on e^^ression for residues known to be on the surface 
of the protein or in loops, a population of genes will be 
generated that code on expression for a population of 

35 proteins, many members of which will fold into roughly the 
same 3D structure as the parental protein. Most often we 
generate mutations that are concentrated within codons for 
residues thought to make contact with the DNA. Secondari- 
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ly, we introduce mutations into codons specifying residues 
that are not directly involved in DNA contact but that 
affect the position or dynamics of residues that do contact 
the DNA. 

5 

In general, a variegated population of DNA molecules, 
each of which encodes one of a large ( e.g. 10"^) number of 
distinct potential target-binding proteins, is used to 
transform a cell culture. The cells of this cell culture 

10 are engineered with binding marker genetic elements so 
that, under selective conditions, the cell thrives only if 
the expressed potential target-binding protein in fact 
binds to the target subsequence preventing transcription of 
these binding marker genetic elements. (Typically, binding 

15 of a successful target-binding protein to the target 
subsequence blocks expression of a gene product that is 
deleterious under selective conditions. Alternatively, 
binding of a successful target-binding protein can inacti- 
vate a strong promoter that otherwise occludes transcrip- 

20 tion of a beneficial gene.) The mutant cells are directed 
to express the potential target-binding proteins and the 
selective conditions are applied. Cells expressing 
proteins binding successfully to the target are thus 
identified by in vivo selection. If the binding character- 

25 istics are not fully satisfactory, the amino acid sequences 
of the best binding proteins are determined (usually by 
sequencing the corresponding genes) , a new population of 
DNA molecules is synthesized that encode variegated forms 
of the best binding proteins of the last cull, mutant cells 

30 are prepared, the new population of potential DNA-binding 
proteins is expressed, and the best proteins are once again 
identified by the superior growth of the corresponding 
transformants under selective conditions. The process is 
repeated until a protein or polypeptide with the desired 

35 binding characteristics is obtained. Its corresponding 
gene may then be moved to a suitable expression system. 
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In the simplest form of this invention, the mutant 
cells are provided with a selectable genetic element, the 
transcription of which is deleterious to the sujrvival or 
growth of the cell. The selectable genetic element either 
5 is a promoter or is operably linked to a promoter regulat- 
ing the expression of the gene. The promoter, or other 
non-coding region of the; genetic element (for example^ an 
intron) , has been modified to include the desired target 
subsequence in a position where it will not interfere with 

10 transcription of the selectable gene unless a protein binds 
to that target subsequence • Each mutant cell is also 
provided with a gene encoding on expression a potential 
DNA-binding protein, operably linked to a promoter that is 
preferedaly regulated by a chemical inducer. When this gene 

15 is expressed, the potential DNA-binding protein has the 
opportunity to bind to the target and thereby protect the 
cell from the selective conditions under which the product 
of the binding marker gene would otherwise harm the cell. 

20 In addition to the desired outcome of these in vivo 

selections, there exist a number of possible genetic events 
that allow the cells to escape the selection, producing 
artifacts and inefficiency by allowing the growth of 
colonies that do not express the desired . sequence-specific 

25 DNA-binding proteins. Examples of mechanisms, other than 
the desired outcome, that lead to cell survival xinder the 
selective conditions include: a) a point mutation or a 
deletion in the selectable gene eliminates expression or 
function of the selectable gene product; b) a host chromo- 

30 somal mutation compensates for or suppresses function of 
the selectable gene product; c) the introduced potential 
DNA-binding protein binds to a DNA subsequence other than 
the chosen target siibsequence and blocks expression of the 
selectable gene; d) the introduced potential DNA-binding 

35 protein binds to and inactivates the gene product of the 
selective gene; and e) a DNA-binding protein endogenous to 
the host mutates so that it binds to the selectcible gene 
and blocks expression of the sel ctable gene. 
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This invention relates, in particular, to the design 
of a vector that confers upon the host cells the desired 
conditional sensitivity to the selection conditions in such 
5 a manner as to greatly reduce the likelihood of false 
positives and artif actual colonies. 

First, at least two selectable genes that are func- 
tionally unrelated are used to reduce the risk that a 

10 single point mutation in the vector (or in the host 
chromosome) will destroy the sensitivity of the cell to 
the selective conditions, since it will eliminate only one 
of the two (or more) deleterious phenotypes. Similarly, a 
single introduced gene for a potential DNA-binding protein 

15 that binds to and inactivates the gene product of one 
selectable gene will not bind and inactivate the gene 
product of the other selectable gene. The likelihood that 
point mutations will occur in both selectable genes or that 
two host chromosomal mutations will spontaneously arise 

20 that suppress the effects of two genes is the product of 
each single individual probabilities of the necessary 
event, and thus is extremely low. 

The DNA sequences of the two or more selectable genes 
25 preferably should not have long segments of identity: a) to 
avoid isolation of a DBP that binds these identical regions 
instead of the intended target sequence, and b) to reduce 
the likelihood of genetic recombination. The degeneracy of 
the genetic code allows us to avoid exact identity of more 
30 than a few, e,a. 10, bases. 

Second, the selectable genes are placed on the vector 
in alternation with genetic elements that are essential to 
plasmid maintenance. Thus, a single deletion event, even 
35 of thousands of bases, cannot eliminate both selectable 
genes without also eliminating vital genetic elements. 
Alternatively, the selectable genes are placed in the 
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bacterial chromosome. Spontaneous deletions from the 
chromosome are rare. 

Third, different promoters are associated with each 
5 of the selectE±»le genes. This ensures that the selection 
does not isolate cells harboring genes encoding on expres- 
sion novel DNA-binding proteins that bind specifically to 
subsequences that are part of the promoter but not the 
chosen target subsequence. Each cell expresses only one or 

10 a few introduced potential DNA-binding proteins (multiple 
potential DNA-binding proteins could arise if one cell is 
transformed by two or more variegated plasmids) . The 
probability that two such proteins will occur in one cell 
and that one will bind to the promoter of the first 

15 selectable gene and that the second will bind to the 
different promoter of the second selectable gene is very 
small. 

Fourth, the selectable binding marker genes may be 
20 placed on a vector different from the vector that carries 
the potential dbp gene. DNA manipulations that introduce 
variegation into the potential dbp gene can cause mutations 
in the vector remote from the site of the intended muta- 
tions. Thus, we may place the selectable binding marker 
25 genes in the bacterial chromosome or oh a separate plasmid 
that is compatrible with the dbp vector. 

Finally, the same promoter is used to initiate 
transcription of two genes: a) one of the deleterious 

30 selectable binding marker genes, and b) a beneficial or 
essential gene also borne on the plasmid and used to 
select for uptake and maintenance of the plasmid f e.g, an 
antibiotic resistance gene, such as bla ) . In the case of 
the beneficial or essential gene, however, there is no 

35 instance of the predetermined target DNA subsequence 
associated with the promoter- Thus, if a DNA-binding 
protein binds to a sxibsequence of the promoter other than 
the predetermined target DNA subsequence, it will frustrate 
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expression of the beneficial or essential one. If desired, 
more than one such beneficial or essential gene may be 
provided. In that vent, copies of prom ter A may be 
operably linked to both deleterious gene A» (with an 
5 instance of the target) and beneficial gene A" (without an 
instance of the target) , while copies of promoter B are 
operably linked to both deleterious gene B' (with target) 
and beneficial gene B" (without target) . 

The selection system described above is a powerful 
tool that eliminates most of the artifacts associated with 
selections based on cloning vectors that use a single 
selectable gene or that have all selectable genes in a 
contiguous region of the plasmid. While this invention 
embraces using the aforementioned elements of a selection 
system singly or in partial combination, most preferably 
all are employed. 

In one embodiment, the invention relates to a cell 
2 0 culture comprising a plurality of cells, each cell bearing: 

i) a gene coding on expression for a potential DNA- 
binding protein or polypeptide, where such protein or 
polypeptide is not the same for all such cells, but 
25 rather varies at a limited ntimber of amino acid 

positions; and 

ii) at least two independent operons, each comprising at 
least one binding marker gene coding on expression 
for a product conditionally deleterious to the 
survival or reproduction of such cells, the promoter 
of each said binding marker gene containing a prede- 
termined target DNA subsequence so positioned that, if 
said target DNA subsequence is bound by a DNA-binding 
protein or polypeptide, said conditionally deleterious 
product is not expressed in functional form. 



10 



15 



30 



35 
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Most: known DNA-blnding proteins bind to palindromic or 
nearly palindromic operators. It is desirable to be able 
to obtain a protein or polypeptide that binds to a target 
DNA siibseguence having no particular sequence symmetry. In 
5 another embodiment of the present invention, such a binding 
protein is obtained by creating a hybrid of two dimeric 
DNA-binding proteins, one of which (DBP^) recognizes a 
symmetrized form of the left subsequence of the target 
subsequence, and the other of which (DBPj^) recognizes a 
10 symmetrized form of the right subsequence of the target 
subsequence. 

Cells producing equimolar mixtures of DBP^ and DBPi^ 
contain approximately 1 part (DBPx,)2/ 2 parts DBPl:DBPr, 

15 and 1 part (DBPr) 2. The DBPx,:DBPr heterodimers , which bind 
to the non-symmetric target subsec[uence, may be isolated 
from a cell lysate by affinity chromatography using the 
target sequence as the ligand. If desired, the hetero- 
dimers may be stabilized by chemically crosslinking the two 

20 binding domains. 

It is also possible to modify both DBP^ and DBPj^, by a 
process of variegation and selection, so that they have 
(without disturbing their affinity for the predetermined 

25 DNA target subsequence^ complementary but not dyad-sym- 
metric protein-protein binding surfaces. When such 
polypeptides are mixed, in vivo or in vitro , the primary 
species will be DBPl:DBPr heterodimers. Alternatively, re- 
versing the steps, a dimeric binding protein may be 

30 modified so that its two binding domains have complementary 
but not dyad-symmetric protein-protein binding surfaces, 
and then the DNA-contacting surfaces are modified to bind 
to the right and left halves of the target DNA subsecpience. 
In either case , the resulting cooperative domains can be 

35 crosslinked for increased stability. 

When a binding protein is engineered so that its two 
binding domains have complementary, but not dyad-symmetric 
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protein-protein binding surfaces, then in the preferred 
embodiment one of the steps will be a "reverse selection", 
i>e, a selection for a protein that does not bind to the 
symmetrized half -target sequence. To facilitate such 
5 reverse selection, it is desirable that the binding marker 
genes be capable of "two-way" selection (VIN087) . For a 
two-way selectable gene there exist both a first selection 
condition in which the gene products are deleterious 
(preferably lethal) to the cell and a second selection 

10 condition in which the gene product is beneficial (prefer- 
ably essential) to the cell. The first selection condition 
is used for forward selection in which we select for cells 
expressing proteins that bind to the target so that gene 
expression is repressed. The second selection condition is 

15 used for reverse selection in which we select for cells 
that do not express a protein that binds to the target, 
thereby allowing expression of the gene product. 

Abolition of function is much easier than engineering 
20 of novel function. Reverse selection can isolate cells 
that: a) express no DBP,/ b) express unstable proteins 
descendant from a parental DBP, c) express a protein 
descendant from a parental DBP having very nearly the same 
3D structure as the parental DBP, but lacking the func- 
25 tionality of the parent. We are interested in this third 
class. It is difficult, however, to distinguish among 
these classes genetically. Therefore, when using reverse 
selection, we carefully choose sites to mutate the protein 
(so as to minimize the chances of destroying tertiary 
30 structure) and we introduce a lower level of variegation 
than in forward selection. We must verify biochemically 
that a stable, folded protein is produced by the isolated 
cells. 

35 Another concept of the present invention is the use of 

a polypeptide, rather than a protein, to preferentially 
bind DNA. This polypeptide, instead of binding the DNA 
molecul as a preformed molecule having shape complementary 
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to DNA, will vind about the DNA molecule in the major or 
minor groove. Such a polyp ptide has the advantages that: 
a) it is smaller than a protein having equivalent recogniz- 
ing ability and may be easier to introduce into cells, and 
5 b) it may serve as a model for creation of other compounds 
that bind DNA seguence-specif ically. 

In a preferred embodiment, transcription of the DKA 
that codes on expression for potent ial-DNA-binding proteins 
10 or polypeptides is regulated by addition of chemical 
inducer to the cell culture, such as isopropylthiogalac- 
toside (IPTG) . Other regulatable promoters having dif- 
ferent inducers or other means of regulation are also 
appropriate. 

15 

The invention encompasses the design and synthesis of 
variegated DNA encoding on expression a collection of 
- closely related potential DNA-binding proteins or polypep- 
tides characterized by constant and variable regions, said 
20 proteins or polypeptides being designed with a view toward 
obtaining a protein or polypeptide that binds a predeter- 
mined target DNA siibseguence. 

For the purposes of this invention, the term "poteh- 
25 tial DNA-binding polypeptide" refers to a polypeptide 
encoded by one species of DNA molecule in a population of 
variegated DNA wherein the region of variation appears in 
one or more subsequences encoding one or more segments of 
the polypeptide having the potential of serving as a DNA- 
30 binding domain for the target DNA sequence or having the 
potential to alter the position or dynamics of protein 
residues that contact the PNA. A "potential DNA-binding 
protein" (potential-DBF) may comprise one or more potential 
DNA-binding polypeptides. Potential-DBPs comprising two or 
35 more polypeptide chains may be homologous aggregates ( e.g. 
A2) or heterologous aggregates ( e.g. AB) • 
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From time to time, it may be helpful to speak of the 
"parental sequence" of the variegated DNA. When the novel 
DNA-binding domain sought is a homolog of a known DNA- 
binding domain, the parental sequence is the sequence that 
5 encodes the known DNA-binding domain. The variegated DNA 
is identical with this parental sequence at most loci, but 
will diverge from it at chosen loci. When a potential DNA- 
binding domain is designed from first principles, the 
parental sequence is a sequence that encodes the amino acid 
10 sequence that has been predicted* to form the desired DNA-. 
binding domain, and the variegated DNA is a population of 
"daughter DNAs" that are related to that parent by a high 
degree of sequence similarity. 

15 The fundamental principle of the invention is one of 

forced evolution . The efficiency of the forced evolution 
is greatly enhanced by careful choice of which residues are 
to be varied. The 3D structure of the potential DNA- 
binding domain and the 3D structure of the target DNA 

20 sequence are key determinants in this choice. First a set 
of residues that can either simultaneously contact the 
target DNA sequence or that can affect the orientation or 
flexibility of residues that can touch the target is 
identified. Then all or some of the codons encoding these 

25 residues are varied simultaneously to produce a variegated 
population of DNA. The variegated population of DNA is 
introduced into cells so that a variegated population of 
cells producing various potent ial-DBPs is obtained. 

30 The highly variegated population of cells containing 

genes encoding potential-DBPs is selected for cells 
containing genes that express proteins that bind to the 
target DNA sequence ("successful DNA-binding proteins") . 
After one or more rounds of such selection, one or more of 

35 the chosen genes are examined and sequenced. If desired, 
new loci of variation are chosen. The selected daughter 
genes of one generation then become the parental sequences 
for the next g neration of variegated DNA (vgDNA) . 
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DNA-binding proteins (DBFs) that bind specifically to 
viral DNA so that transcription is blocked will be useful 
in treating viral diseases, either by introducing DBFs into 
5 cells or by introducing the gene coding on expression for 
the DBF into cells and causing the gene to be expressed. 
In order to develop such DBFs, we need use only the 
nucleotide sequence of the viral genes to be repressed. 
Once a DBF is developed, it is tested against virus in 

10 vivo . Use of several independently-acting DBFs that all 
bind to one gene allow us to: a) repress the gene despite 
possible variation in the sequence, and b) to focus 
repression on the target gene while distributing side 
effects over the entire genome of the host cell* Animals, 

15 plants, fungi, and microbes can be genetically made 
intracellular ly immune to viruses by introducing, into the 
germ line, genes that code on expression for DBFs that bind 
DNA sequences found in viruses that infect the animal 
(including human) , plant, fungus, or microbe to be protect- 

20 ed. 

Sequence-specific DBFs may also be used to treat 
autoimmune and genetic disease either by repressing 
noxious genes or by causing expression of beneficial 
25 genes. 

Some naturally-occurring DBFs bind sequence-specif i- 
cally to DNA only in the presence of absence of specific 
effector molecules. For example. Lac repressor does not 

30 bind the lac operator in the presence of lactose or 
isopropylthiogalactoside (IFTG) ; Trp repressor binds DNA 
only in the presence of tryptophan or certain analogues of 
tryptophan. The method of the present invention can be 
used to select mutants of such DBFs that a) recognize a 

35 different cognate DNA sequence, or b) recognize a different 
effector molecule. These alterations would be useful 
becaus : a> known inducible or de-repressible DBFs allows 
us to use the novel DBF without affecting existing metabo- 
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lie pathways. Having novel effectors allows us to induce 
or de-repress the regulated gene without altering the state 
of genes that are controlled by the natural effectors. In 
addition^ temperature^sensitive DBFs could be made which 
5 would allow us to control gene expression in the same way 
that X CI857 and and Pj^ are used* 

Conferring novel DNA-recognition properties on 
proteins will allow development of novel restriction 

10 enzymes that recognize more base pairs and therefore cut 
DNA less frequently. For example, the methods of the 
present invention will be useful in developing a derivative 
of EcoRI (recognition GAATTC) that recognizes and cleaves a 
longer recognition site, such as TGAATTCA. Proteins that 

15 recognize specific DNA sequences may also be used to block 
the action of known restriction enzymes at some subset of 
the recognition sites of the known enzyme, thereby conferr- 
ing greater specificity on that enzyme. Other DNA-binding 
enzymes may also be obtained by the methods described 

20 herein. 

The methods of the present invention are primarily 
designed to select from a highly variegated population 
those cells that contain genes that code on expression for 
25 proteins that bind sequence-specifically to predetermined 
DNA sequences. The genetic constructions employed can also 
be used as an assay for putative DBFs that are obtained in 
other ways. 

30 BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 Schematic of protein bound to DNA. 

Figure 2 Schematic of evolution of a binding protein. 

Figure 3 Plasmid pKK175-6. 

35 Figure 4 Plasmid pAA3H. 

Figure 5 Summary of construction of pEP1009. 

Figure 6 Plasmid pEPlOOl. 

Figure 7 Plasmid pEP1002. 
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Figure 8 Plasmid pEP1003. 
Figure 9 Plasmid pEP1004. 
Figure 10 Plasmid pEPlOOS. 
Figure 11 Plasmid pEP1007. 
5 Figure 12 Plasmid pEPlOOQ. 

DETAIIiED DESCRIPTION OF THE CTVENTION AND ITS PREFERRED 
EMBODIMENTS 

10 

Abbreviations ; 

The following abbreviations will be used throughout 
the present invention: 

15 

^ear^j.ng 

DNA-binding protein 
A gene encoding the initial DBP 
A gene encoding a potential-DBP 
variegated DNA 
double-stranded DNA 
single-stranded DNA 
Tetracycline resistance or 
sensitivity 

Galactose resistance or sensi- 
tivity . 

Ability or inability to utilize 
galactose 

Fusaric acid resistance or 
sensitivity 

Kanamycin resistance or sensi*^ 
tivity 

Ampicillin resistance or sensi- 
tivity 



Abbyevj.atAon 
DBP 
idbp 
Pdbp 
20 vgDNA 
dsDNA 
ssDNA 
Tc^, Tc^ 

25 GalR, Gal^ 

Gal+, Gal" 

Fus^/ Fus^ 

30 

Km^, KmS 
Ap^, Ap^ 

35 
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Terminology 

A domain of a protein that is required for the 
protein to specifically bind a chosen DKA target subse- 
5 quence, is referred to herein as a "DNA-binding domain". A 
protein may comprise one or more domains, each composed of 
one or more polypeptide chains. A protein that binds a DNA 
sequence specifically is denoted as a "DNA-binding pro- 
tein". In one embodiment of the present invention, a 

10 preliminary operation is performed to obtain a stable 
protein, denoted as an "initial DBF", that binds one 
specific DNA sequence- The present invention is concerned 
with the expression of numerous, diverse, variant "poten- 
tial-DBPs" , all related to a "parental potential-DBP" such 

15 as a knovm DNA-binding protein, and with selection and 
amplification of the genes encoding the most successful 
mutant potential-DBPs . An initial DBF is chosen as 
parental potent ial-DBP for the first round of variegation. 
Selection isolates one or more "successful DBFs". A 

20 successful DBF from one round of variegation and selection 
is chosen to be the parental DBF to the next round. The 
invention is not, however, limited to proteins with a 
single DNA binding domain since the method may be applied 
to any or all of the DNA binding domains of the protein, 

25 sequentially or simultaneously. 

Amino acids are indicated by the single-letter code, 
AUSU87, Appendix A. 

30 Symbols that represent ambiguous DNA are: T, C, A, G 

for themselves; M for A or C; R for A or G; W for A or T; 
S for C or G; Y for T or C; K for G or T; V for A, C, or G; 
H for A, C, or T; D for A, G, or T; B for C, G, or T; N for 
any base. 

35 
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10 



Conventionally, DKA sequences are written from 5' to 
3', left»to-right. 



anti-sense DNA: 5* ATG CiT TTC ... 3« 

sense DNA: 3 • TAG GAA AAG ... 5 • 

inRNA: 5' AUG COU UUC ... 3' 

protein: M - L - P - 



We will use the convention that the "sense" strand is the 
stremd used as template for nRNA synthesis. 

i 

In the present invention, the words "grow", "growth", 
15 "culture", and "amplification" mean increase in number, not 
increase in size of individual cells. In the present 
invention, the words "select" and "selection" are used in 
the genetic sense; i.e. a biological process whereby a 
phenotypic characteristic is used to enrich a population 
20 for those organisms displaying the desired phenotype. 

One selection is called a "selection step"; one pass 
of variegation followed by as many selection steps as are 
needed to isolate a successful DBP, is called a "variega- 

25 tion step". The amino acid sequence of one successful DBP 
from one round becomes the parental potential-DBF to the 
next variegation step . We perform variegation steps 
iteratively until the desired affinity and specificity of 
DNA-binding between a successful DBP and chosen target DNA 

30 sequence are achieved. 

In a "forward selection" step, we select for the 
binding of the PDBP to a target DNA sequence; in a "reverse 
selection" step, for failure to bind. The target DNA 
35 sequence may be the final target sequence of interest, or 
the immediate target may be a related sequence of DNA 
(e.g., a "left symmetrized targ t" or "right symmetrized 
target") . There is an important distinction between 



wo 90/07862 



PCr/US90/00024 



25 

screening and selection. Screening merely reveals which 
cells express or contain the desired gene. Selection 
allows desired cells to grow under conditions in which 
there is little or no growth of undesired cells (and 
5 preferably eliminates undesired cells) . 

The term "operon" is used to mean a collection of one 
or more genes that are transcribed together. We will use 
operon to refer also to one or. more genes that are tran- 
10 scribed together in eukaryotic cells independent of post- 
transcriptional processing. 

The term "binding marker gene" is used to mean those 
gehes engineered to detect sequence-specific DNA binding, 

15 as by association of a target DNA with a structural gene 
and expression control sequences. A single operon may 
include more than one binding marker gene (e.g., aalT.K ) . 
A "control marker gene" is one whose expression is not 
affected by the specific binding of a protein to the target 

20 DNA sequence. The "control promoter" is the promoter 
operably linked to the control- marker gene. 

Palindrome, palindromic, and palindromically are used 
to refer to DNA sequences that are the same when read along 
25 either strand, e>a. 

Palindromic DNA 

Rotational axis 
30 4 

5' C T A G C C T;A G G C T A G 3' 
3'GATCGGATCCGATC5* . 

The arrow indicates the center of the palindrome; if the 
35 sequence is rotated 180° about the central dot, it appears 
unchanged. In the present application, "Palindromic" does 
not apply to sequences that have mirror symmetry within one 
strand, such as 
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Mirror Plane 

I . 

5' C T A G C C T|T C C G A T C 3' 
3' G A T C G 6 A|A G G C T A G 5' 

5 I 

DNA sequences can be partially palindromic about some 
point (that can be either between two base pairs or at one 
base pair) in which case soxae bases appear unchanged by a 
10 180^ rotation while other bases are chajiged. 

A special case of partially palindromic sequence is a 
"gapped palindrome" in which palindromically related bases 
are separated by one or more bases that lack such symmetry: 

15 

Gapped Palindrome 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
5 • C T A G C T T T C C G G C T A G 3 ' 
3» G A T C G A A A G G C C G A T C 5« 

20 

has CTAGC (bases 1-5) palindromically related to GCTAG 
(bases 12-16) while the sequence TTTCCG (bases 6-11) in 
the center has no symmetry. 

25 For the purposes of this invention, a "non-deleterious 

cloning site" is a region on a plasmid or phage that can be 
cut with one restriction enzyme or with a combination of 
restriction enzymes so that a large linear moleciile 
comprising all essential elements can be recovered. 

30 

Overview; Standard Methods 

Bacterial strains are cultured by standard methods 
(DAVI80, MILL72, AUSU87) . Constructions of vectors are by 
35 standard methods (MANI82, Z0LL84, AUSU87) . All genetic 
constructions are confirmed, first by analysis with 
restriction enzymes, and then by sequencing. Sequencing is 
by the Sanger dideoxy meth d or by Maxam Gilbert chemical 
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method. Constructions that confer a phenotype are tested 
f r display of the desired phenotype. These necessary 
controls are not described repeatedly. 

5 Overview: The Selection System 

The present invention separates mutated genes that 
specify novel proteins with desirable sequence-specific 
DNA-binding properties from closely related genes that 

10 specify proteins with no or undesirable DNA-binding 
properties, by: 1) arranging that the product of each 
mutated gene be expressed in the cytoplasm of a cell 
carrying a chosen DNA target subsequence, and 2) using 
genetic selections incorporating this chosen DNA target 

15 subsequence i:o enrich the population of cells for those 
cells containing genes specifying proteins with improved 
binding to the chosen target DNA sequence. 

A selectably deleterious gene is positioned relative 
20 to, usually downstream from, the target sequence so that 
the gene is not expressed if a successful DNA-binding 
protein specific to this target is expressed in the cell 
and binds the target sequence. Alternatively, a selectable 
beneficial g^ne may be arranged so that its transcription 
25 is occluded by a strong promoter (ADHY82, ELLE89a, ELLE8- 
9b). The target sequence is placed in or near the occlud- 
ing promoter so that successful binding by a protein will 
repress the occluding promoter and allow transcription of 
the beneficial gene. Elledge and coworkers disclose that 
3 0 such systems work best in the bacterial chromosome or on 
low-copy-humber plasmids. The cell will survive exposure 
to the selective conditions transcription of the selectably 
deleterious genetic element is blocked. 

35 The preferred cell line or strain is easily cultured, 

has a short doubling time, has a large collection of well 
characterized s lectabl genes, includes variants that are 
deficient in genetic recoiabination, and has a well devel- 
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oped transformation system that can easily produce at least 
10^ independent trans formants/ug of DNA. Bacterial cells 
are preferxed over yeasts, fungi, plant, or animal cells 
because they are superior on every coxint. Among bacteria, 
5 coli is the premier candidate because of the wealth of 
knowledge of genetics and cellular processes* Other 
bacterial strains, such as Sj. tvphimurium , Pseudomonas 
^erugj^npsa, yielpsjeljla aeroaenes. Bacillus subtilis , or 
Streptomvces coelicolor could be used. DBFs that bind to 
10 host regulatory sequences, such as promoters, will be 
toxic. Thus, development of a DBF that specifically binds 
to coli promoters is preferably done in a cell line or 
strain, such as coelicolor . having significantly 

different promoter sequences. 

15 

In the most preferred embodiments, all novel DBFs are 
developed in coli recA " strains. The recA" " genotype is 
preferred over other rec " mutations because recA" mutation 
reduces the frequency of recombination more than other 

20 known rec " mutations and the recA " mutation has fewer un- 
desirable side effects. We choose a host strain that 
methylates or does not methylate the target sequence in the 
desired way. For example a Dcm" strain is appropriate if 
the target sequence contains CCwGG and we want a DBF that 

25 binds the unmethylated form. 

As vectors, phage, such as M13, have the advantage of 
a high infectivity rate. Organisms or phage having a phase 
in their life cycle in which the genome is single-stranded 

30 DNA have a higher mutation rate than organisms or phage 
that have no phase in which the genome is single-stranded 
DNA. Plasmids are, however, preferred because genes on 
plasmids are much more easily constructed and altered than 
are genes in the bacterial chromosome and are more stable 

35 than genes borne on phage, such as M13. M13 derived 
vectors are nearly as preferred as plasmids. 
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In some embodiments, the cloning vector will carry: a) 
the selectable genes for succ ssful DBP isolation, b) the 
pdbp gene, c) a plasmid origin of replication, and d) an 
antibiotic resistance gene not present in the recipient 
5 cell to allow selection for uptake of plasmid. Preferably 
the operative vector is of minimum size. 

Alternatively, the selecteible binding marker genetic 
elements are placed on a vector different from but com^ 
10 patible with the vector that carries the pdbp gene. This 
arrangement has the advantages that engineering the pdbp 
gene is easier on a smaller plasmid and manipulation of 
Pdbp can not introduce mutations into the selectable 
binding marker genes. 

Standard selections for plasmid uptake and maintenance 
in E. coli include use of antibiotics ( e.g. ampicillin 
(Ap)) as shown in Table 2. Selection of cells with 
antibiotics is preferred to nutritional selections, e.g. 

20 TrpA^, for several reasons. Nutritional selection may be 
overcome by large volumes of cells or growth medium; host 
chromosomal auxotrophy is rarely total; crossfeeding of the 
non-growing cells by prototrophic recipients obscures the 
outlines of the colonies; and late mutations to prototrophy 

25 may arise on the plate due to spontaneous mutation of 
nongrowing cells. Nonetheless, nutritional selection may 
be employed. 

Similarly, plasmids for use in 1. subtilis are 
30 engineered for selection of uptake and maintenance using 
antibiotics. Plasmids used in streptomycete species bear 
genes for resistance to antibiotics such as thiostrepton, 
neomycin, and methylenomycin, in preference to auxotrophic 
markers or sporulation and pigment screens such as spo in 
35 bacilli and mel in streptomycetes . 

Recombinant DNA manipulations in yeasts have been 
achieved using complementation of auxotrophic markers, some 
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of which are shown in Table 3. High backgrounds are 
surmoxmted by use of two unr lated binding marker genes 
carried on the same vector, e.g. . Iieu2'^ and UraS**". 
Selection for G418 resistance conferred by the bacterial 
5 aph ll gene expressed in yeast offers the advantages of 
reduced background and a wider range of appropriate 
recipient strains. The ciirrent upper range of efficiency 
of DNA uptake into yeast cells indicates that this organism 
is not now preferred for the process described in this 
10 patent r although results could be achieved by large scale 
practice. 

The selection systems must be so structured that other 
mechanisms for loss of gene expression are much less likely 

15 than the desired result, repression at the target DNA 
subsequence. Other mechanisms that could yield the desired 
phenotype include: point mutations that inactivate the 
deleterious gene or genes ^ deletion of the deleterious gene 
or genes, host mutations that suppress the deleterious 

20 genes, and repression at a site other than the target DNA 
sequence. 

A wide range of selectable phenotypes for E^. coli and 
Si tvphimurium have been described (VIN087) . Two broad 

25 classes of selections are useful in this invention, 
nutritional and chemical. Such selections are inherently 
conditional in that they eiaploy addition of a growth-in- 
hibitory chemical to the selective medium, or manipulation 
of the nutrient components of the selective medixim, 

30 Further conditional ity of the preferred method is imposed 
by transcriptional regulation f e.g, by IPTG in combination 
with the lacUVS promoter and the LacI*I repressor) of the 
variegated pdbp gene. in those members of the population 
that express DBFs that bind to the tcorget, IPTG indirectly 

35 controls the selectable genes; in these cells, increased 
IPTG leads to reduced expression of the selectable genes. 
Therefore the phenotypes for selection are distinguished 
only in the presence of an inducing chemical, and potential 
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storage and routine handling of the strains - 

Selection of mutant strains capable of producing 
5 proteins that can bind to the target DNA subsequence is 
enabled by engineering conditional lethal genes or growth- 
inhibiting genes located downstream from the promoter that 
contains the target DNA subsequence. In the preferred 
embodiment, at least two independent conditional lethal or 

10 inhibitory selections are performed simultaneously. It is 
possible to use a single selection to achieve the same 
purpose, but this is not preferred. Two selections are 
strongly preferred since a simple mutation in the selected 
gene, occurring at a frecpiency of 10"^ to 10"Vcell, would 

15 occur in two selected genes simultaneously at the product 
of the individual frequencies, 10'"^^ to 10""^^. Thus use of 
two selections substantially reduces the probability of 
isolation of artifactual revertant or suppressor strains. 

20 Selectable genes for which both forward and reverse 

selections exist are preferred because, by changing host or 
media, we can use these genes to select for binding by a 
DBP to a target DNA sequence such that expression of one of 
these genes is repressed, or we can select phenotypes 

25 characteristic of cells in which there is no binding of the 
DBP. For example, expression of the tet gene is essential 
in the presence of tetracycline. On the other hand, 
expression of the tet gene is lethal in the presence of 
fusaric acid. Expression of the aalT and aalK genes in a 

30 GalE" host in the presence of galactose is lethal (NIKA61) • 
Expression of aalT and galK in a host that is GalE"*" and 
either GalT*" or GalK* renders the cells Gal"^ and allows 
them to grow on galactose as sole carbon source. 

35 The term "source of a selective agent" includes the 

selective agent itself and any media components which cause 
the cell to manufacture the sel ctive agent. 



wo 90/07862 



PCT/US90/00024 



32 

The Detailed Examples describe selection of strains 
with successful DBP binding to novel target subsequences 
due to tixrn off of two genes, each of which, if expressed, 
confers sensitivity to a toxic substance. It is also 
5 possible to use selection of strains in which successful 
DBP binding to novel target operators turns off repressors 
of genes encoding required gene products. For example, 
using the binding marker gene P22 arc , we place an Arc 
operator site so that binding of Arc represses expression 

10 of a beneficial or conditionally essential gene, such as 
amp . Another alternative is selection of expression of 
required gene products due to successful binding of DBP 
proteins derived from positive effectors as the DBP, e/g^ 
CAP from coli . the repressor from phage \, or the Cro67 

15 (BUSH88} mutant of X Cro. Another alternative is to place 
the target sequence in or near a strong promoter that 
occludes transcription of a conditionally essential gene 
(ELLE89a,b) • 

20 The selections described in the Detailed Examples 

employ commercially available cloned genes on plasmids in 
strains that can be obtained from the ATCC (Rpclcville, MD) . 
Alternatively, the genes can be produced synthetically from 
published sequences or isolated from a suitable genomic or 

25 cDNA library. 

Numerous types of selections are possible for selec- 
tion of DBP expression in E. coli. The toxic and inhib- 
itory agents listed in Table 4 are used with appropriately 

30 engineered host strains and vectors to select loss of gene 
function listed above. Repression of transcription of 
these genes allows growth in the presence of the agents. 
Other outcomes such as deletions or point mutations in 
these genes may also be selected with these agents, hence 

35 two functionally unrelated selections are used in combina- 
tion. These agents share the property that c 11 metabolism 
is stopped, and unlike the nutritional selections, the 
inhibit ry agents, are not overcome by components of the 
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growth medium or turnover of macromolecules in the cells. 
Selections using antibiotics, metabolite analogs, or 
inhibitors are preferred. Another class of selections 
includes those for repression of phage or colicin recep- 
5 tors, or for repression of phage promoters. These agents 
kill by single-hit kinetics, and in the case of phage, are 
self-replicating, making the multiplicity of agent to 
putative repressed cell much more difficult to control and 
so are not preferred (BENS86) • 

10 

Any selection system relevant to the cell line or 
strain may be substituted for those in the examples given 
here, with appropriate changes in the engineering of the 
cloning vectors. One example is the dominant pheS "*" gene 
15 carried on plasmid pHE3 (ATCC #37,161) in a pheS12 back- 
ground. Turn-off of pheS" *" is selected with p-f luorophenyl- 
alanine (Sigma Corp. , St. Louis, MO) . 

We could choose the Streotomvces coelicolor cloned 
20 glucose kinase gene for selection of the DBP"^ phenotype, 
using the metabolite analog deoxyglucose. 

Each batch of antibiotic is checked for HIC (minimum 
inhibitory concentration) under the condition of use. 
25 Increased concentration of antibiotic may be used to 
increase the stringency of the select io, in most cases. 

The user varies the medium formulation (pH, cation 
concentrations, buffering agent, etc. ) for a particular 
30 selection if the results are not optimal with the strain at 
hand. For example, Maloy and Nunn (MALQ81) describe a 
medium yielding improved selection of Fus^ Sj. coli colonies 
from a Tc^ background, compared to the medium employed by 
Bochner (BOCH80) for this purpose using tvphimurium . 

35 

Stringency of select i n can be modulated by control- 
ling copy nximber of plasmids bearing the selectable genes; 
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increasing copy niimber of selectable genes increases the 
stringency of the selection. 

During the initial phases of the progressive devel- 
5 opment of DBF molecules, it is desirable to produce a high 
intracellular concentration of DBF. The stringency of the 
selection is increased in subsec[uent phases of successful 
DBF development by allowing fewer molecules of DBF per 
cell. Thus it is preferred to regulate transcription of 
10 Pdbp by an inducible or derepressible promoter r such as 
PlacUVS . 

High total cell input often decreases stringency of 
selections/ by providing metabolites that are specifically 

15 omitted, by mass action with respect to an inhibitory 
agent, or by generating a large number of artificial 
satellite colonies that follow the appearance of genetical- 
ly resistant colonies. The number of cells that are 
successfully transformed is a fianction of efficiency of 

20 ligation and transformation processes, both of which are 
optimized in the embodiment of this invention. Procedures 
for maximal transformation and ligation efficiency are from 
Hanahan (HANA85) and Legerski and Robberson (LEGES 5) re- 
spectively. Increasing stringency is imposed under the 

25 conditions of high efficiency of these processes by 
inoculation of plates with small volumes or dilutions of 
cell samples. Filot experiments are performed to deter- 
mine optimum dilution and volume. 

30 In Detailed Example 1,. the transformation event is 

followed by dilution and growth of cells in permissive 
medium following transformation. Exogenous inducer of DBF 
expression is included at this step, and a set of selec- 
tions are then imposed in liquid medium. Surviving cells 

35 are concentrated by centrifugation, and selected for these 
and additional traits using solid medium in Petri plates. 
This protocol offers the advantage that fewer identical 
siblings are obtained and a larger population is easily 
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screened. In Detailed Example 1, repression of the Gal^ 
phenotype is selected by exposing transformants to galac- 
tose in liquid medium, which produces visible lysis of 
galactose sensitive cells. The second selection employed 
in Detailed Example 1 is for the Fus^ phenotype due to 
repression of Tc^, which requires limitation of total 
inoculvm size to 10^ cells/plate. Similar protocol 
variations are introduced to combine selections for 
transformation and successful DBF function. 

Tests of selective agents to determine the conditions 
that kill or inhibit sensitive cells are performed with 
pure cultures of sensitive cells. These include strains 
carrying the selective marker genes having the recognition 
sequence of the IDBP as target, with and without idbp . and 
with and without the inducer of idbp expression. 

Cultures of sensitive cells are applied to selective 
media as inocula appropriate to the selection (usually 10^ 

20 to 10® per plate). Sufficient numbers of replicates (10*7 
to 10^ total sensitive cells for each medium) are tested by 
each selection. The rate at which the cultures produce 
revertants and phenotypic suppressors (considered together 
as revertants) is determined. A rate greater than 10"^ per 

25 cell indicates that stringency must be increased. If 
reversion rates are below this level, as we have shown for 
the selections described in Example 1, mixing experiments 
are performed to determine the sensitivity of recovery of a 
small fraction of resistant cells from a vast excess of 

30 sensitive cells. 

Normally, the deleterious gene product of a binding 
marker gene is a protein. It may also be an HNA, e.g. , an 
mRNA which is antisense to the mRNA of an essential gene 
35 and therefore blocks translation of the latter mRNA into 
protein. Another alternative is that transcription of the 
binding mark r gene may be deleterious b cause this 
transcription occludes transcription of an adjacent 
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beneficial gene. Selectively deleterious genes suitable 
for use in the present invention include those shown in 
Table 4. 

5 The two selectably deleterious genes are prefersQDly 

not functionally rel^ated. For example, the chosen genes 
should not code for proteins localized to or affecting the 
same macromolecular assembly in the cell or which alter the 
same or intersecting anabolic or catabolic pathways • Thus, 

10 use of two inhibitors that select for mutations affecting 
RNA synthesis 1^ aromatic amino acid synthesis, or each of 
histidine and purine synthesis are not preferred. Similar- 
ly, two inhibitors that are transported into the cell by 
shared membrane components are thus functionally related, 

15 and are not preferred. In this manner the user reduces the 
frequency of isolation of single host mutations that yield 
the apparent desired phenotype, because of' suppression of 
the shared functionality, interacting component, or 
precursor relationship. Host mutations of this type are 

20 conveniently distinguished by a screen of the selectable 
phenotypes in the absence of the inducer of the DBP, b.cx. 
IPTG. 

Examples of pairs of deleterious genes which are 
25 recommended for use in the present invention are given in 
Table 5A. In each case, one of the paired genes codes for 
a product that acts intracellularly while the other codes 
for a product that acts either in transport into or out of 
the cell or acts in an unrelated biological pathway. Table 
30 5B gives some pairs that are not recommended. These pairs 
have not been shown to malfunction, but they are not 
recommended, given the laorge number of choices that are 
clearly functionally unrelated. 

35 A preferred novel feature is the use of a copy of the 

promoter f one of thes beneficial or conditionally 
essential genes, opereUDly link d to the target DNA subse- 
quence, to direct transcription of the selectably deleter- 
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ious or conditionally lethal binding marker genes of the 
plasmid. If the pot ntial-DBP should repress the selec- 
table gene by binding to this promoter, it would also 
repress this beneficial activity* 

5 

In order to assure that selection for DBP binding is 
specific to the target and not the promoter, we, preferab- 
ly, place one of the two selectable binding marker genes 
under the same transcription initiation signal as the gene 
10 we use for selection of vector maintenance. In Detailed 
Example 1, transcription of the aalT and oalK genes is 
initiated by the Pamp promoter, as is the amp gene. 

It is possible that the potential-DBP will bind 

15 specifically to the boundary between the target DNA 
sequence and the promoter, or within the structural gene. 
In the preferred embodiment, we discriminate against this 
mechanism by choosing a different promoter, operably* linked 
to another copy of the same target DNA sequence, for the 

20 second selectable gene. Preferably, this two promoters that 
initiate transcription of the selectable genes should be 
strong enough to give a sensitive selection, but not too 
strong to be repressed by binding of a novel DEP. Some 
well studied promoters and their scores by the Mulligan 

25 algorithm (MULLBA) are shown in Table 6. Promoters that 
score between 50% and 70% are good candidates for use in 
binding marker genes. Preferably, the two promoters have 
significant sequence differences, particularly in the 
region of the junction to the target DNA sequence. 

30 Specifically, the region between the -10 region and the 
target sequence, which comprises five to seven bases, 
should have no more than two identical bases in the two 
promoters. Although the -10 regions of promoters show high 
homology, promoters are known ( e.g. Pamp having GACAAT and 

35 Pneo having TAAGGT) that have as few as two out of six 
bases identical in this region, and such diff rence is 
pref rr d. 
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The target DNA sequence for the pot ntial DNA-binding 
protein must be associated with the two deleterious binding 
marker genes and their promoters so that expression of the 
binding marker genes is blocked if a novel protein in fact 
5 binds to the target sequence. The target DNA sequence 
could appear upstream ^ of the gene, downstream of the gene, 
or, in certain hosts, in a noncoding region f viz, an 
"intron") within the gene. Preferably, it is placed 
upstream of the coding region of the gene,, that is, in or 

10 near the RNA polymerase binding site for the gene^ i,e, the 
promoter. If the binding marker gene is an occluding 
promoter, the target is, preferably, placed downstream of 
the promoter. Placement of the target DNA sequence 
relative to the promoter is influenced by two main consid- 

15 erations: a) protein binding should have a strong effect on 
transcription so that the selection is sensitive, b) the 
activity of the promoter in the absence of a binding 
protein should be relatively unaffected by the presence of 
the test DNA sequence compared to any other target sub- 

20 sequence. 

In the present invention, we will deal primarily with 
DNA target sxibsequences of 10 to 25 bases. It has been 
noted that the highly conserved -35 region and the highly 

25 conserved -10 region are separated by between 15 and 21 
base pairs with a mode of 17 base pairs (HAWL83, MULL84) . 
Some of the bases between -35 and -10 are statistically 
non-random; thus placement of target DNA sequences longer 
than 10 bases between the -10 and -35 regions would likely 

30 affect the promoter activity independent of binding by 
potential-DBPs. Because quantitative relationships between 
promoter sequence and promoter strength are not well 
understood; it is preferable, at present, to use known 
promoters and to position the target at the edge of the RNA 

35 polymerase binding site. 

Protein binding to DNA has maximum effect on tran- 
scription if the binding site is in or just down-stream 
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from the promoter of a gene. Hoopes and McClure (HOOP87) 
have reviewed the regulation of transcription initiation 
and report that the LexA binding site can produce effective 
repression in a variety of locations in the promoter 
5 region. In a preferred embodiment, we place the target DNA 
sequences that begin with A or G so that the first 5' base 
of the target sequence is the +1 base of the mRNA, as the 
LexA binding site is located in the uvrD gene {HOOP87, 
pl235) . If the target sequence begins with C or T, we 
10 preferably place the target so that the first 5' base of 
the target is the +2 base of the mRNA and we place an A or 
G at the +1 position. An alternative is to place the 
target DNA sequences upstream of the -35 region as the LexA 
binding site is located in the ssb gene (HOOP87, pl235) . 

15 

It may be useful in early stages of the development of 
a DBF to have more than one copy of the target DNA sequence 
positioned so that binding of a DBF reduces transcription 
of the selectable gene. Multiple copies of the target DNA 

20 sequence enhances the sensitivity of phenotypic charac- 
teristics to binding of DBFs to the target DNA sequence. 
Multiple copies of the target DNA sequence are, preferably, 
placed in tandem downstream of the promoter. Alternative- 
ly, one could place one copy upstream of the promoter and 

25 one or more copies downstream. 

We arrange the genes on the plasmid or plasmids in 
such a way that no single deletion event eliminates both 
deleterious genes without also eliminating a gene essential 

30 either to plasmid replication or cell survival. Thus, 
resistant colonies are unlikely to arise through deletions 
because two independent deletion events are required. 
Similarly, simultaneous occurrence of one point mutation 
and one deletion is as unlikely as two point mutations or 

35 two deletions. 



A typical arrangement of genes on the operative 
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cloning vector, similar to that used in Detailed Example 1, 
is: 



10 



15 



/ — PlacTJVSe-pdbp- t—PxS— qalT-aalK -t — \ 



\- ^2 



— — — > 

c -amp -t — PyS - *tet -t— — ori -/ 



jgx represents the promoter that initiates transcription of 
the amp gene. A second copy of 2x initiates transcription 
of qalT.K . Pjt is a promoter .driving tet . t is a transcrip- 

20 tional terminator (different terminators may be used for 
different genes), and 1 is the target subsequence. PlacUVS 
is the lacUVS promoter, represents the lacO operator, and 
pdbp is a variegated gene encoding potential DBFs. 
Placement of the pdbp relative to other genes is not 

25 important because mutations or deletions in pdbp cannot 
cause false positive colony isolates. Indeed, it is not 
necessary that the pdbp gene be on the selection vector at 
all. The purpose of the selection vector is to ensure that 
the host cell survives only if the one of the PDBPs binds 

30 to the target sequence (forward selection) or fails to so 
bind (reverse selection). The pdbp gene may be introduced 
into the host cell by another vector. 



Two-way selections are available for both tet and 
35 qaiy,y; fvide suora^ . The orientation of each gene in the 
selection vector is unimportant because strong terminators 
(B.a;. rrnBtl. rmBt2 . phage fd terminator) are preferably 
placed at the ends of each transcription vmit. That galT.K 
and tet are separated by essential genes, however, is of 
40 fundamental importanc . The sequence ori is ssential for 
plasmid replication, and the amp ^ene, the transcription of 
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which is initiated by Px, is essential in the presence of 
Ap. Successful repr ssion of aalT ,K and tet is selected 
with galactose and fusaric acid. No single deletion event 
can remove both the latter genes and allow plasmid main- 
5 tenance or cell survival under selection. In addition, 
binding by a novel DBP to the £x promoter would render the 
cell Ap sensitive. These arrangements make appearance of a 
novel DBP that binds the target DNA more probable than any 
of the other modes by which the cells can escape the 
10 designed selections. 

Overview; Choice of target DNA binding secn ience for 
development of successful novel DBFs: 

15 Our goal is the development, in part by conscious 

design and in part by in vivo selection, of a protein 
which binds to a DNA sequence of significance, e.g., a 
structural gene or a regulatory element, and through such 
binding inhibits or enhances its biological activity. In 

20 the preferred embodiment, the protein represses transcrip- 
tion of a deleterious element, such as a viral gene. A 
sufficiently long sequence could be the target of several 
independently acting DBFs. 

25 Another goal of this invention is to derive one or 

more DBFs that bind seguence-specif ically to any predeter- 
mined target DNA subsequence. It is not yet possible to 
design the DBP-^domain amino-acid sequence from a set of 
rules appropriate to the target DNA subsequence. Rather, 

30 it is possible to pick sets of residues that can affect the 
DNA recognition of a parental DBF. Then, variegation of 
residues that affect DNA recognition coupled with selection 
for binding to the target DNA subsequence can produce a 
novel DBF specific for the target DNA subsequence. Such a 

35 method is limited by the number of amino acids that can be 
varied at ne time. To develop a novel DBF that recognizes 
15 bases could require changing 15 or more residues in the 
initial DBF. Variegation of 15 residues through all 20 
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amino acids would produce 20^^ « 3.3 x 10^^ sequences and 
is beyond current technology. Thus we start with the 
recognition sequence of the initial DBP, change two to five 
bases and select, in one or more rotmds of variegation and 
5 selection, a novel DBF that recognizes this new target DNA 
subsequence. This new DBF becomes the parent to the next 
step in which the target DHA subsequence is changed by an 
additional two to five bases so that a stepwise series of 
changes in binding protein and changes in target is used. 
10 It is emphasized here that, although we initially select 
DBFs that recognize sequences similar to that recognized by 
the IDBP, the ultimate target secpience recognized by the 
desired final DBF can be completely unrelated to the 
recognition sequence of the IDBF. 

The process of finding a DBF that recognizes a 
. sequence within a genome is shortened if we pick sequences 
that have some similarity to the cognate sequence of the 
initial DBF. The intent is to locate several unique sites 
20 in the gene which can be bound specifically by DBFs such 
that transcription through those sites is reduced. 

The sequences of some regions of genes of eulcaryotic 
pathogens vary among strains (SAAG88) . To optimize the 
25 search for target sites in the gene selected for repression 
such that repression . will be effective in all or the 
majority of strains of a pathogen, regions of conserved DNA 
sequence within the gene are, preferably, identified. 

30 There may be a very small nximber of sequences that 

occur in the genome of the host cells for which binding of 
a DBF will be lethal. For this reason, the regulatory 
sequences, such as promoters, of the host organism are not 
preferred targets for DBF development. Freferably, the 

35 target sequence occurs only in the gene of interest. For 
some applications, target sequences that occur at locations 
other than the site of intended action . may be used if 
binding f a protein to the extra sites is acceptable. 
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Preliminary limination of non-unique segu nc s is 
done by searching DNA sequence data banks of host genomic 
sequences and bacterial strain sequences, and by searching 
5 the plasmid sequences for matches to the potential target 
subsequences. Remaining potential target subsequences are 
then used as oligonucleotide probes in Southern analyses of 
host genomic DNA and bacterial DNA. Sequences which do not 
anneal to host or bacterial DNA under stringent conditions 
10 are retained as target subsequences. These target subse- 
quences are cloned into the operative vector at the 
promoters of the selection genes for DBP function, as 
described for the test DNA binding sequence. 

15 Choice of target subsequences is based also on the 

optimal location of target sites within a gene such that 
transcription will be maximally affected. Studies of 
monkey L-cells show that lac repressor can bind to lac 
operator, or to two lac operators in tandem, in the L-cell 

20 nucleus (HUHC87, HUMC88} . Further, this binding results in 
repression of a downstream chloramphenicol acetyl trans- 
ferase gene in this system, and repression is relieved by 
IPTG. Two tandem operators repress CAT enzyme production 
to a greater extent than a single operator. The user 

25 preferably locates two to four target sites relatively 
close to each other within the transcriptional unit. 

Overview; Strategies for Obtaining Protein Recognition of 
Non-Svnmetric Target DNA Seguences 

30 

In vitro . lac repressor binds to a perfectly palin- 
dromic synthetic lac operator which omits the central base 
pair of the natural operator 10 times more tightly than it 
does to the wild-type operator (SADL83) . In vivo, the 
35 synthetic operator represses beta-galactosidase activity to 
a 4 -fold lower level than does the wild-type repressor. 
Simons e£ al. (SIM084) describe the isolation of five lac 
operator-like subsequences from eukaryotic DNA that titrate 
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lac repressor ia vivo * All five subsequences share a 14 bp 
consensus subsequence tha^b lacks the central base pair of 
the natural las operator and is a perfect palindrome of the 
left seven base pairs of the natural lac operator. A 
5 synthetic 11-base pair inverted repeat of the left half of 
the £a. coli lac operator binds lac repressor 8-fold more 
tightly than does the natural operator. We conclude that 
natural repressors have not evolved to have maximal 
affinity for their operators, rather they have evolved to 
10 produce optimal regulation. 

Ej, coli trp repressor (BASS87) and X repressor 
(BENS88) symmetrized operator subsequences bind their 
respective repressors more tightly than do the natural 
15 operators* For X repressor, xznlike lac repressor, the 
optimal binding subsequence both includes a base pair at 
the center of symmetry and contains a non-consensus base 
pair (BENS88). 

20 It is important to note that the focus of all of the 

above experiments has been on symmetry: symmetric oper- 
ators, symmetric changes in protein binding residues, etc . 
In the natural systems discussed cibove, increasing operator 
subsequence syiametry towards the consensus palindrome does 

25 indeed increase the strengths of the binding interactions. 
This result arises, however, not from symmetry per se , but 
from optimizations of the protein-DNA interactions at both 
operator half-sites. If the DNA-binding protein presents a 
different binding domain to the operator at each half -site, 

30 symmetric DNA operator subsequences are not only not 
optimal but are unfavorable. The implications of this 
distinction have not been considered in the literature. 

Starting from natural, dyad symmetric or de novo 
35 designed DBFs we can generate specific DBFs with non- 
symmetric target recognition using a variety of strategies. 
Seven examples of strategies are listed; however, this 
inv ntion is not limited to these particular strategies. 
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1) Produce two diner ic DBFs. On DBP is produced by the 
means described h re to recogniz a symmetrized 
version of the left half of the target and is called 

5 DBPl* The other DBP is similarly produced to recog- 

nize a symmetrized version of the right half of the 
target and is called DBPj^. Cells producing eguimolar 
mixtures of DBPr and DBP^ contain approximately 1 part 
DBPjj dimer^ 2 parts DBP^^rDBFj^ heterodimer, and 1 part 
10 DSPr dimer. Thus one half of the DBP molecules bind 

to the non-symmetric target subsequence. These 
heterodimers may be isolated by affinity separation 
techniques, or the 50% active mixture may be used 
directly. 

15 

2) Produce a mixture of DBPj^ and DBPj, as described in (1) 
and crosslink proteins with an agent such as glutar- 
aldehyde. Use a coliomn that contains the DNA target 
subsequence to purify DBFj^iDBFj^ heterodimer from the 

20 homodimers. 

3) Produce (by variegation of the dimerization interface 
of a known DBP, as described more fully hereafter) a 
heterodimer comprised of complementing mutant seguen- 

25 ces DBPl and DBP2 such that the heterodimer DBP1:DBP2 

is exclusively formed. Next, alter the recognition 
domains of DBPl and DBP2 by the methods described here 
to produce heterodimers having asymmetric recognition, 
e.g. DBP1l:DBP2r. 

30 

4) Produce a heterodimer DBP1:PBP2 as in (3) and cross- 
link the proteins In vitro with an agent such as 
glutaraldehyde as in (2) . 

35 5) Produce two dimeric DBFs with left and right target 
r cognition lements as in (1) ; produce compl menting 
heter dimer mutations as in (3) such that the non- 



wo 90/07862 



PCr/US90/00024 



46 

symmetric recognition heterodimer DBPlx,:DBP2j^ is 
constructed. 

6) Produce a pseudo-dimer composed of a single polypep- 
tide chain such that recognition elements that 
contact different bases are encoded by different 
codons; each DNA-contacting resitiue and every domain 
is independently variable and so asymmetric recogni- 
tion can be established. 



10 



7) Produce DBP^ and DBPr in separate steps where heter- 
odimers of. DBPrDBPjj is developed to recognize a 
hybrid target consisting of the wild type left half- 
site fused to the right half of the target and 

15 DBP^rDBP is developed to recognize a hybrid target 

consisting of the wild type right half -site fused to 
the left half of the target. Once produced, DBPx, and 
DBPu are co-expressed intracellularly as described in 
(1) above, crosslinked as described in (2) eibove, or 

20 are modified to produce the obligately complementing 

non-symmetric recognition heterodimer DBPl:DBPj^ as 
described in (5) above. 

Detailed Example 1 employs strategy 5; Detailed 
25 Example 2 employs strategy 6. Section 6 of Detailed 
Example 1 also describes strategy 3. 

For each target DNA sequence chosen, a left arm T^, a 
center core Tq and a right arm Tr are defined. Two 
30 symmetrized derivatives of this target subsequence, the 
left symmetrized target Tl"^-Tc-Tx,<~ and the right sym- 
metrized target Tr<""-Tc-Tr"> are designed and synthesized. 

We divide the target DNA secpaence into T^r Tq, and Tr 
35 based on knowledge of the interaction of the parental DBF 
with DNA sequences to which it binds, i,e. th operator. 
This knowledge may come from X-ray structures of parental 
DBF-operator complexes, models bas d on 3D structures of 
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the DBP, genetics, or chemical modification of parental 
DBP-operator complexes. 

Our strategy is to pick a target by finding a sequence 
5 that contains a close approximation to the central core of 
the operator. Bases in the center of the target may not be 
contacted directly by the DBP but affect the specificity of 
binding by influencing the position or flexibility of the 
bases that are contacted directly by the DBP- Accommodat- 

10 ing changes (operator jtSj. target) in uncontacted bases may 
require subtle changes in the tertiary or quaternary 
structure of the DBP, such as might be effected by altera- 
tions in the dimerization interface of a dimeric DBP. We 
can accommodate most changes in bases directly contacted by 

15 the DBP by altering the residues that contact those bases. 
Therefore, it is easier to accommodate changes in those 
bases that are directly contacted by the DBP and we 
endeavor to avoid changes in the central core by seeking a 
target the central core of which is highly similar to the 

20 central core of the operator of the parental DBP. 

We must balance two tendencies: a) if we assign too 
many bases to T^, we are unlikely to find a close approx- 
imation of Tc in the genome of interest; and b) if we 

25 assign too few bases to Tq, we may thereby assign uncon- 
tacted bases to the arms. Differences between the target 
and the DNA sequence that binds the initial DBP at uncon- 
tacted bases in the arms may be difficult to accommodate 
through variegation of residues that contact the DNA 

30 directly; such a situation could cause variegation and 
selection to yield a functional DBP very slowly. Prefer- 
ably, the length of is at least 6 but not greater than 
10. 

35 We search the target genome, first with the entire 

operator binding sequence, and then with progressively 
shorter central fragments of the operator , until an 
acceptable match is found. A match is acceptable if all or 



wo 90/07862 



PCr/US90/00024 



48 

almost all the bases r e.a> six out of seven) match and 
other criteria are met. 

Consider matching 6 of 7 bases as the criterion for 
5 choosing a target. The original sequence is acceptable, as 
are the 21 (=^7 x 3} secjuences that differ by one base. 
There are 4*^ = 2^^ = 16384 possible heptamers. Thus we 
should expect to find an acceptable match every 16384/22 = 
745 bases. Similarly, matching 7 of 8 bases should occur 

10 every 65536/25 » 2622 bases; matching 8 of 9 bases should 
occur every 262144/28 = 9362 bases. These expected 
frequencies are such that viruses, which have genome sizes 
ranging from 5 x 10-^ bases up to 10^ bases or more, should 
have one or more matches of 6 of 7 bases. Larger viruses 

15 should contain matches of 7 of 8 or even 8 of 9 bases. 

Other criteria may include restricting the search to 
parts of the genome not known to vary among different 
isolates of the organism. 

20 

If the longest matching search sequence is such that 
bases known to have no direct contact with the DBF are 
assigned to the arms, then we increase the size of Tq to at 
least seven and then use a progression of core sequences to 

25 move in a stepwise fashion from a sequence that closely 
resembles the operator of the parental DBP to that of the 
target. We obtain an acceptadsle DBP for each target by 
variegation and selection. The best DBF from one target 
becomes the parental DBF for the next target in the 

30 progression. Accommodating changes in uncohtacted bases in 
the central core may require variegation of residues in the 
protein: protein interface to produce subtle changes in the 
tertiary and quaternary structure of the DBF. 

35 To illustrate this process, we consider the target 

.chosen for Detailed Example 1. The HIV 353-369 target 
subsequ nee AGTTTCCGCTGGGGACT is nucleotides 353 to 369 of 
the HIV-1 genome (RATN85) , chosen because of the close 
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match of the central 7 bp of the Kim Consensus sequence 
(CCGCGGG) of \ Cro (KIMJ87) to the underscored bases. No 
non-variable HIV sequence matched the nine central bases of 
any 0R3-like sequence. Highly conserved bases of \ 0^3 are 
5 written bold with stars above. 

123456789 
* * * * * * 

Or3 5« TATCACCGCAAGGGATA 3' 
10 3* ATAGT GGCGTTG CCTAT 5' 

HIV-1 5 » ACTTTCCGCTGGGGACT 3 ' 
3* TGAAAGGCGACCCGTGA 5' 
353 t 

15 

T^, Tq, and Tj^ are defined, in this case, as the first 5 
bases, the center 7 bases (underlined), and the last 5 
bases, respectively, of the target svibsequence . Tl"^ 
differs from the corresponding bases of Oj^3 at four of 
20 five bases, including the strongly conserved A2 and C4. 
Tl"*^" is complementary to Tj/^i 

Tl""^ = 5' ACTTT 3' 
'r^<- = 3 • TGAAA 5 ' 

25 

We create the synraietrized target by rotating Ti^" about 
the center of the 17 bp sequence into the same strand as 

Tl^>^ = 5 ' ACTTT AAAGT 3 » = T^^" 

3 0 Tl<" = 3' TGAA A t 5' 

Tr'^" differs from the corresponding bases of Or3 at three 
of five positions, including the highly conserved A2. We 
rotate Tr^" into the same strand as Tr"^: 
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Tr<" = 5 ■ AGTCC GGACT 3 • = Tr~> 

3 • t C CTGA 5 ' « Tr<" 

Symmetrized operators derived from HIV 353-369 are: 

Left Symmetrized Target: 
5» A C T T T C C G C T G G A A A G T 3" 
3« T G A A A G G C G A C C T T T C A 5» 
I Tl-> 1 Tc I Tx.<- I 



and 



Right Symmetrized Target: 
5 • AGTCC C C G C T G G G G A C T 3 " 
3« T C A G G G G C G A C C C C T G A 5' 
15 I Tr<- I Tc 1 Tr-> 1 

The two symmetrized derivatives are engineered into the 
appropriate vectors so that each of these sequences 
regulates the expression of the designed selectable genes 
20 of each of the respective vectors. 

Had the best match in HIV to the CCGCGGG core been, 
for example, CTGCTGG, then we would use the symmetrized 
targets shown etbove until we found acceptable DBFs for the 
25 right and left targets. At that point, we would change the 
symmetrized targets to : 

Second Left Symmetrized Target: 

30 if 

51 A C T T T CTGCTGG A A A G T 3' 
3 ' T G A A A G A C G A C C T T T C A 5' 
I Tl-> I Tc 1 Tl<- I 



35 and 
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Second Right Symmetrized Target: 

t 

5 • A G T C C C T G C T G G G G A C T 3 • 

31 T C A G G G A C G A C C C C T G A 5' 

5 I Tr<- I Tc I Tr-> I 

Using these targets and the selected right and left DBFs as 
new parental DBFs, we would initiate a new round of 
variegation and selection. 

10 

As another example, consider the 14 bp 434 operator. 
We could take each arm as 4 bp and the central core as 6 
bp. We are likely to find good matches to the 6 bp core in 
any genome larger than 4096 bases. Thus, this division of 
15 the operator is preferred over that which assigns 5 bp to 
each arm and 4 bp to the core. 

In order to obtain proteins that bind to these 
symmetrized targets, we generate a population of potential 

20 dbp genes by synthesizing DNA that codes on expression for 
part or all of a potential DBF and having variegated bases 
in the codons that encode residues of the parental DBF that 
are thought to contact the DNA or that influence the 
detailed position or dynamics of residues that contact the 

25 DNA. The variegation in the chosen codons, embodied in 
the synthetic DNA, is trans fered to the pdbp gene either by 
replacement of a cassette or by annealing a mutagenic 
oligonucleotide to ssDNA. 

30 The pdbp gene may be part of the vector that carries 

the selectable binding marker genes or may be separate. 
Two sets of selectable binding marker genes are prepared, 
one carrying the Right Symmetrized Targets (RST) and one 
carrying the Left Symmetrized Targets (LST) . If the pdbp 

35 gene is on a different vector from the selectable binding 
marker genes, then RST and LST selection strains are 
prepared. A highly variegated population of pdbp genes is 
delivered into cells that also contain one of: a) the RST- 
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containing selectable g nes,. or b) the LST-containing 
selectable genes. 

The two sets of transformed cells are selected for 
J 5 vector uptake and successful repression at low stringency 
of selection. In the case described in Detailed Example 1, 
cells containing DBFs will be Tc^^. Fus^, and Gal^. 

After one or more variegation steps, DBFs that bind 
10 tightly and specifically to each of the Left Symmetrized 
and Right Symmetrized Targets are obtained. These DBFs are 
designated, in general terms, DBFj^ and DBFr, respectively. 
If these proteins are produced in equal amounts in the same 
cell, then approximately 50% of DBF protein dimers consist 
15 of the DBFl:DBFj^ heterodimer. This may be sufficient for 
repression of the target. In the preferred embodiment, 
further mutations are introduced into the DBF^ and DBPj^ 
proteins, as described below, to enable 100% of the 
molecules to form heterodimers. 

20 - 

In an especially preferred entoodiment , variegation of 
the gene to alter its DNA-specif icity is combined with 
variegation of the gene to alter its dimerization (protein- 
protein binding) characteristics, so that the formation of 
25 the heterodimer DBP]^:DBF|^ is favored. The variegation of 
the dimerization interface may precede (strategy 3) or 
follow (strategy 5) the alteration of the DNA specificity. 
Simultaneous variegation at both sites is also possible. 

30 The DMA-binding proteins considered here interact 

with specific DNA sequences as multimers (usually dimers or 
tetramers) (FAB084) . Monomers usually associate indepen- 
dently and the resulting multimer interacts with DNA. 
Coupling between oligomerization and DNA-binding equilibria 

35 results in explicit inclusion of oligomerization effects in 
the apparent affinity of DNA-binding proteins for their 
operators (JOHNS 0, RIGG70, and CHAD71) . 
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The precise geometry of the protein in the complex 
with DNA strongly influences the strength of the interac- 
tion with DNA. For example, Sauer et al^ generated a 92 
amino-acid fragment of \ repressor carrying the YC88 
5 mutation. This N-terminal domain dimerizes through a 
covalent S-S bond. Although the dissociation into monomers 
is immeasurable, the binding to DNA is diminished about 10- 
fold relative to intact \ repressor (SAOE86) . 

10 The results presented by Reidhaar-Olson and Sauer 

(REID88) , summarized in Table 7, show which residues in 
the dimerization region, when varied, will produce func- 
tional homodimers of N-terminal domains with little 
alteration of structure. Wide variation is tolerated at 

15 solvent exposed positions 85, 86, and 89. In contrast, 
almost no substitutions are tolerated at the buried 
positions 84 and 87. Most hydrophobic residues are 
functional at position 91 (except P (We use the single- 
letter code for amino acids, AUSU87, Appendix A)) although 

20 aromatic residues are excluded. The hydrophobic interac- 
tions among 184, M87« and V91* had previously been shown to 
be major components of dimerization free energy (NELS83, 
W£IS87b) . In general, mutations that destabilize \ 
repressor N-terainal dimerization are similar to those that 

25 destabilize global protein structure. 

The P22 Mnt repressor, like X Cro, is a small 
protein containing both DNA-binding and oligomerization 
sites. Unlike Cro, P22 Mnt is a tetramer in solution 
30 (V£RS85b, VERS87a) • The amino acid sequence of Mnt has 
been determined (VERS87a) but the three dimensional 
structure of the protein is not known. Knight and Sauer 
(KNIG88) have shown, by sequential deletion of C-terrainal 
residues, that Y78 is essential for tetramer formation. 

35 

A preferred embodiment of this process utilizes 
information available on protein structure obtained from 
crystallographic, modeling, and genetic sources to predict 
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the residues at which Butation results in stable protein 
monomers that retain substantially the same 3D structure as 
the wild-type DBP, but that fail to form dimers. Dimeriza- 
tion mutsmts are constructed using site-directed mutagen- 
5 esis to isolate one or more user specified substitutions at 
chosen residues. The process starts using one of the genes 
selected for binding to a symmetrized target, denoted dbO ]^ 
(dbpi could be either the dbp j^ gene or the dbgR gene) , as 
the parental sequence, so that each of several specific 
lb mutations is .engineered into the gene for a protein 
binding specifically to the symmetrized target used in the 
selection (the Left Symmetrized Target in the case of the 
dbpj v gene) . 

15 Reverse selection isolates cells not expressing a 

protein that binds to the target DNA sequence. This 
phenotype could arise in several ways, including: a) a 
mutation or deletion in the dbp j^ gene so that no protein is 
produced, b) a mutation that renders the descendant of the 

20 parental DBP^ unstable, c) a mutation that allows the 
descendant of the parental DBPi to persist and to fold into 
nearly the same 3D structure as the parental DBP, but which 
prevents oligomerization. It is anticipated that reverse 
: selection will isolate , many genes for non-functional 

25 proteins and that these proteins must be analyzed until a 
suitzUsle oligomerizat ion-mutant is found. Therefore, we 
choose sites carefully so that we maximize the chance of 
disrupting oligomerization without destroying tertiary 
structure. We also use lower levels of variegation in 

3 0 reverse selection so that the niimber of mutants to be 
analyzed is not too large. For forward selection, the 
number of different mutants is prefereibly 10^ to 10^, and 
more preferably greater than 10^,. For reverse selection, 
it is 10^ to 10^. (Under certain circumstances, the number 

35 of reverse selection mutants could be as low as 10-20) . 

Cassettes bearing the site-specific changes are 
synthesized and each is ligated into the vector at the 
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appropriate site in the dbp ^ g ne. Transformants are 
obtained by the antibiotic-resistance selection for vector 
maintenance r e.g. Ap) , and scr ened for loss of repression 
of the selective systems xinder control of Dbp^ binding. 
5 Defective dimerization results in stibstantially decreased 
DNA affinity, hence the altered derivatives are recognized 
by screening isolates obtained using the selectable gene 
systems. In Detailed Example 1 (where dbp^ is dbp^) , 
dimerizat ion-defective derivatives are Tc^, Fus^ and Gal^ 
10 in £i coli delta4 cells (Gal"*" in cells of :£i coli strain 
HBlOl) . Appropriate controls are used to verify that the 
loss of repression is due to a substitution in dbp^/ 

In Detailed Example 1 (using an engineered synthetic \ 

15 cro gene designated rav) , the mutant Rav protein with 
specific binding to the Left Symmetrized Target (designated 
Ravx,; gene, rav j^) is used to produce a derivative defective 
in dimerization. Studies of X Cro suggest that the dimer 
is stabilized by interactions in an antiparallel beta sheet 

20 between residues E54, V55 and K56 from each monomer 
(ANDE81, PAB084). In addition, F58 appears to stabilize 
the Cro dimer through hydrophobic interactions between F58 
of one monomer and residues in the hydrophobic core of the 
other monomer (TAKE85). Further, mutational studies 

25 (PAKU86) show that some sxibstitutions at E54 and at F58 
result in decreased intracellular specific protein levels 
and that these mutant proteins lack repressor activity. 
Mutants are constructed, by using site specific mutagenesis 
to isolate VF55 and FW58 mutants of Rav^. (Point mutations 

30 are written as XYnnn, where X is the amino acid found at 
location nnn and Y is the amino acid found in the mutant. ) 
The cassettes bearing mutations that confer the VF55 and 
FW58 substitutions are synthesized, and each is ligated 
into the operative vector at the appropriate site within 

35 the rav ^ gene. Selections and characterizations are as 
' described above. Thes alleles are designat d rav^-SS and 
ravxi-58. 
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Alternative methods of obtaining dimerization-defec-* 
tive DBP derivatives are not excluded. Thus the ray* *" gene 
(Detailed Example 1) or a potehtial-dbE"*^ gene coding for 
any globular dimerizing protein, may be subjected to 
5 Structure-directed Mutagenesis of residues involved in the 
protein-protein dimer interface. In the case of the rav' *' 
allele, residues 7, 23, 25, 30, 33, 40, 42,. 52, 54, 55 and 
58 are candidates for mutagenesis. 

10 For example, mutagenesis of £av+ residues 52, 54, 55 

and 58, using a cassette carrying vgDNA at codohs specify- 
ing these residues, is followed by ligation and transforma- 
tion of cells. Selection is applied for plasmid main- 
tenance (Ap^) and loss of repression (Tc^, and galactose 

15 utilization in HBlOl cells) . Variegated plasmid DNA is 
purified from a population of Ap^ Tc^ Gal"^ cells and 
analyzed with restriction enzymes and Southern blotting. 
Plasmid preparations containing the vg -rav fragment of 
predominantly rav" ^ molecular weight are retained, and are 

20 designated va -ravA . 

To isolate a second dimer-specif ic rav mutant protein, 
designated RavB, such that the mutation in ravB is comple- 
mentary to a mutation contained in the vg-ravA population, 

25 Structure-directed Mutagenesis is performed oh a second 
copy of the rav gene, designated ravB . carried on a plasmid 
conferring a different antibiotic resistance ( e.g. Km^) . 
Residues affecting the same dimer interface are varied. 
Competent vq -ravA cells are transformed with the vg ravB 

30 plasmid preparation. Transformants are obtained as Ap^ 
Km^, and further selected for Rav"^ phenotype using the 
selection systems (Tc^^ Fus^, Gal^ in an ^ coli delta4 
cell genetic background) . 

35 The surviving colonies are analyzed by restriction 

analysis of plasmids,^ and are backcrossed to obtain pure 
plasmid lines that conf r ach of the Ap^ and Km^ pheno- 
types. In this manner, mutants bearing obligate comple- 
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menting dimerization all les of ravA and ravB are isolated. 
These rav mutations may be tested pairwise to confirm 
complementation, and are sequenced. The information 
obtained from these mutants is used to introduce these 
5 dimerization mutations into JEav^ and ravR genes previously 
altered by Structure-^directed Mutagenesis in DNA-binding 
specificity domains as described above. 

In one preferred embodiment of thiis invention (stra- 
10 tegy 3), isolation of dbgL dbp j^ mutations that confer 

specific and tight binding to target DNA sequences T^'^-Tq- 
Tl^" and Ti^^"'-Tc-Tr"^ is followed by engineering of second 
site mutations causing a dimerization defect, for example 
dbp i^-'l as described herein. Complementing mutations are 
15 introduced into each of the dbpp and dbp ^^ genes, such that 
obligate heterodimers are co*-synthesized and folded 
together in the same cell and bind specifically to the non- 
pal indromic targets* 

20 A primary set of residues is identified. These 

residues are predictied, on the basis of crystallographic, 
modeling, and genetic information, to make contacts in the 
dimer with the residue altered to produce DBPl-1. A 
secondary set of residues is chosen, whose members are 

25 believed to touch or influence the residues of the primary 
set. An initial set of residues for Focused Mutagenesis in 
the first variegation step is selected from residues in the 
primary set. A variegation scheme, consistent with the 
constraints described herein, is picked for these residues 

30 so that the chemical properties of residues produced at 
each variegated codon are similar to those of the wild-type 
residue; e.g. hydrophobic residues go to hydrophobic or 
neutral, charged residues go to charged or hydrophilic. A 
cassette containing the vgDNA at the specified codons is 

35' synthesized and ligated into the dbE^ gene carried in a 
vector with a different antibiotic selection than that on 
the v ctor carrying the dbp^-l g ne. For example, in 
Detailed Example 1, rav L-55 or rav ^-SS are encoded on 
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plasmids that carry the gene for Ap^. Variegated ravp ^ 
genes are cloned into plasmids bearing the gene for Km^. 

The protein produced by the dbpjj-l allele carrying 
5 the dimerization mutation fails to bind to the dbpj ^ target 
Tj^'^-Tq-T^^". Cells bearing this target as a regulatory 
site upstream of selection genes display the DBF* pheno- 
type. This phenbtype is employed to select complementary 
mutations in a dbp^ gene. Following ligation of the 

10 mutagenized cassette into the appropriate ( e>a, Kra^) 
plasmid, cells bearing dboj i^-l on a differently marked f e.a> 
Ap^) plasmid are made competent and transformed. Trans- 
formants that have maintained the resident plasmid ( e>a. 
Ap^ Km^) are further selected for DBF"*" phenotype in which 

15 binding of the non-pal indromic target subsequence Tx,"^-Tc- 
Tr*^ is required for repression. Only heterodimers con- 
sisting of a 1:1 complex of the dbPj ^-1 gene product and the 
complementing ^Er"1 gene product bind to the target, and 
produce Tc^ Fus^ Gal^ colonies in the appropriate cell 

20 host. 

Each resident plasmid is obtained from candidate 
colonies by plasmid preparation and transformation at low 
plasmid concentration. Strains carrying plasmids encoding 

25 either mutant dbp^-l , or mutant dbpp -l genes are selected 
by the appropriate antibiotic resistance ( e.g. in Detailed 
Example 1, Ap^ or Km^ selection, respectively) ♦ Flasmids 
are independently screened for the DBF"" phenotype, charac- 
terized by restriction digestion and agarose gel electro- 

30 phoresis, and plasmid pairs are co--tested for complementa- 
tion by restoration of the DBF"*^ phenotype with respect to 
the Tx^^^-Tc-Tr^^ target when both dbp alleles are present 
intracellularly. Successfully complementing pairs of dbp 
genes are sequenced. Subsequent variegation steps may be 

35 required to optimize dimer interactions or DNA binding by 
the heterodim r. 
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In Detailed Example 1, the VF55 change in Ravi^-55 
introduces a bulky hydrophobic side group in place of a 
smaller hydrophobic residue. A complementary mutation 
inserts a very small side chain, such as G55 or A55, in a 
5 second copy of the protein. In this case, the primary set 
for mutagenesis is V55. A secondary set of residues 
includes nearby components of the beta strand E54, K56, 
P57, and E53 as well as other residues. Iii the initial 
variegation step, residues 53-57 are subjected to Focused 
10 Mutagenesis, such that all amino acids are tested at this 
location. Cells containing complementing mutant proteins 
are selected by requiring repression of the nonpalindromic 
HIV 353-369 target subsequence Tl'^-Tc^Tr'^" . 

15 In another embodiment, the rav ^-SS allele carrying 

the substitution FW58 and conferring the Rav" phenotype, is 
used for selection of complementing mutations following 
Structure-directed Mutagenesis of the rav j^ gene. Residues 
L7, L23, V25, A33, 140, L42, A52, and G54 are identified 

20 as the principal set. Residues in the secondary set 
include F58, P57, P59, and buried residues in alpha helix 
1. In the initial variegation step residues 23, 25, 33, 
40, and 42 are varied through all twenty amino acids. 
Subsequent iterations, if needed, include other residues of 

25 the primary or secondary set. In this manner, ^avj^-l, -2, 
-3, etc. are isolated, each of which yields a protein that 
is an obligate complement of the rav L-^58 mutation. 
Selection for Rav"*" phenotype using the HIV 353-369 target 
Tl*''^-Tc-Tr"^ sequence is used as described in the preferred 

30 embodiment. 

In either the preferred or alternative embodiments, 
this process teaches a method of constructing obligate 
complementing mutations at an oligomer interface. These 
35 pairs of mutations may be used in further embodiments to 
engineer novel DBFs specific for HIV 681-697 and HIV 760- 
776 targets; for targets in other pathogenic retroviruses 
such as HTLV-II; for other viral DNA-containing pathogens 
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such as HSVX and HSVII; as well as for non-viral targets 
such as deleterious human genes. Similar methodology is 
claimed for engineering DBFs for use in animal and plant 
systems, . 

5 

Overviev: Selection of the Initial DNA-Bindina Protein 
for Variegation 

The choice of an initial DBF is determined by the 
10 degree of specificity required in the intended use of the 
successful DBF and by the availability of known DBFs. The 
present invention describes three broad alternatives for 
producing DBFs having high specificity and tight binding to 
target DNA sequences. The present invention is not limited 
15 to these classes of initial potential DBFs. 

A first alternative is to use a polypeptide that will 
conform to the DNA and can wind around the DNA and contact 
the edges of the base pairs. A second alternative is to 

20 use a globular protein (such as a dimeric H-T-H protein) 
that can contact one face of DNA in one or more places to 
achieve the desired affinity and specificity. A third 
alternative is to use a series of flexibly linked small 
globular domains that can make contact with several^ 

25 successive patches on the DNA. 

DNA features influencing choice of an initial DBF: 

Featxires of DNA that influence the choice of an 
30 initial DBF include sequence-specific DNA structure and the 
size of the genome within which the DBF is expected to 
recognize and affect gene expression. 

Sequence-specific aspects of DNA structure that can 
35 influence protein binding include: a) the edges of the 
bases exposed in the major groove, b) the edges of the 
bases exposed in the minor groove, cj the equilibrium 
positions of the phosphate and deoxyribose groups^ d) the 



wo 90/07862 



PCr/US90/00024 



61 

flexibility of the DNA toward deformation, and e) the 
ability of the DNA to accept intercalated molecules. Note 
that the sequence-specific aspects of DNA are carried 
mostly inside a highly charged molecular framework that is 
5 nearly independent of sequence. 

The strongest signals of sequence are found in the 
edges of the base pairs in the major groove, followed by 
the edges in the minor groove. The groove dimensions 
10 depend on local DNA sequence (NEID87b, KOUD87, ULAN87) . 

The number of base pairs required to define a unique 
site depends on the size and non-randomness of the genome. 
Consider a genome of length 2g bases and consider a 
15 specific subsequence of length Q. If the genome is random, 
the subsequence is expected to occur N(Q) times, where 

2 Zg 2 Zg 

N(Q) = = — — - 

20 4Q 22Q 

From this equation / we derive the expression Q^, which is 
the lower limit of the length of subsequences that are 
expected to occur once or be absent: 

25 

Qu = 1092(2 2g)/2. 



Zg 1032 f? Zg)/2 flu 
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Thus, a DNA. sxibseguence comprising 12 base pairs may be 
unique in the jLu coli genome (5 x 10^ bp) , but is likely to 
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occur about 180 times in a random sequence the size of the 
hiunan genome (3 x 10^ bp) . 

The non-random nature of DNA sequences in genomes has 
5 been shown to result in the over- and under-representation 
of specific sequences. The random-genome model can under- 
estimate the probe length needed to define a unique coding 
sequence (IATH85) • Recognition sites for certain restric- 
tion enzymes occur in clusters and are found much more 
10 often than expected {SMIT87) • In contrast, lac repressor 
binding sites in eukaryotic genomes are almost two orders 
of magnitude less frequent than expected on the basis of 
random sequence (SIK084) • 

15 Protein features influencing choice of initial DBF; 

Sequence-specific binding to DNA by DBFs does not 
require impairing of the bases • Host sequence-specific 
binding by proteins to DNA is thought to involve contacts 
20 in the DNA major groove. 

To be certain of unique recognition in the human 
genome, it is best to design a protein that recognizes 19 
to 21 base pairs. To contact 20 base pairs directly, a 

25 protein would need to: a) wind two full turns around the 
DNA making major groove contacts, b) make a combination of 
major groove and minor groove contacts, or c) contact the 
major groove at four or five places. An extended polypep- 
tide, binding in the major groove of B-DNA, lies about 5.0 

30 A from the DNA axis. One base pair and 1 1/2 amino acids 
extend roughly equal distances along the helix (SAEN83, 
p238). 

A nine residue alpha helix, such as the recognition 
35 helices of H-T-H repressors, extends about 13.5 8 along the 
major groove. If residues with long side chains are 
located at each terminus of the helix, the h lix can make 
contacts over a 20.0 & stretch of the major groove allowing 
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six base pairs to be contacted. Parts of the DBP other 
than the second helix of the H-T-H motif can make addition- 
al protein-DNA contacts, adding to specificity and affin- 
ity. The rigidity of the alpha helix prevents a long 
5 helix from following the major groove around the DNA. A 
series of small domains, appropriately linked, could wind 
around DNA, as has been suggested for the zinc-finger 
proteins (BERG88a, GIBS88, FRAN88) . In an extended 
configuration a polypeptide chain progresses roughly 3.2 to 
10 3.5 k between consecutive residues. Thus, a 10 residue 
extended protein structure could contact 5 to 8 bases of 
DNA. 

Stable complexes of proteins with other macromolecules 
15 involve burial of 1000 fi^ to 3000 of surface area on 
each molecule. For a globular protein to make a stable 
complex with DNA, the protein must have substantial surface 
that is already complementary to the DNA surface or can be 
deformed to fit the surface without loss of much free 
20 energy. Considering these modalities we assign each 
genetically encoded polypeptide to one of three classes: 

1) a polypeptide that can easily deform to complement the 
shape of DNA, 

25 

2) a globular protein, the internal structure of which 
supports recognition elements to create a surface 
complementary to a particular DNA subsequence, and 

30 3) a sequential chain of globular domains, each domain 
being more or less rigid and complementary to a 
portion of the surface of a DNA subsequence and the 
domains being linked by amino acid svibsequences that 
allow the domains to wind around the DNA. 

35 

Complem ntary charges can accelerate association of 
molecules, but they usually d not provide much of the free 
energy of binding. Major components of binding energy arise 
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from highly complementary siirfaces and the liberation of 
ordered water on the macromol cular surfaces. 

Properties of sequence-specific DNA-bindina by 
5 polypeptides : 

An extended polypeptide of 24 amino acids lying in 
the major groove of B-DNA could make sequence-specific 
interactions with as many as 15 base pairs, which is about 
10 the least recognition that would be useful in eukaryotic 
systems. Peptides longer than 24 amino acids can contact 
more base pairs and thus provide greater specificity. 

Extended polypeptide segments of proteins bind to DNA 
15 in natural systems f e.q> \ repressor and Cro, P22 Arc and 
Mnt repressors) . The DNA major groove can accommodate 
polypeptides in either helical or extended conformation. 
Side groups of polypeptides that lie in the major groove 
can make sequence-specific or sequence- independent con- 
20 tacts. Since the polypeptide can lie entirely within the 
major groove, contacts with the phosphates are allowed but 
not mandatory. Thus a polypeptide need not be highly 
positively charged. A neutral or slightly positively 
charged polypeptide might have very low non-specific 
25 binding. 

Polypeptides composed of the 20 standard amino acids 
are not flat enough to lie in the minor groove unless the 
sequence contains an extraordinary number of glycines, 

30 however, residue side-groups could extend into the minor 
groove to make sequence-specific contacts. Polypeptides of 
more than 50 amino acids may fold into stable 3D struc- 
tures. Unless part of the surface of the structure is com- 
plementary to the surface of the target DNA subsequence, 

35 formation of the 3D structure competes with DNA binding. 
Thus polypeptides generated for selection of specific 
binding are preferably 25 to 50 amino acids in length. 
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Polypeptides present the following potential advan- 
tages: 

a) low molecular weight: an extended polypeptide offers 
5 the maximum recognition per amino acid, 

b) polypeptides have no inherent dyad symmetry and so are 
not biased toward recognition of palindromic sequen- 
ces, 

io 

c) polypeptides may have greater specificity than 
globular proteins, and 

d) peptides may be good models from which other low 
15 molecular weight compounds may be designed. 

Thus, one would choose a polypeptide as initial DNA- 
binding molecule if high specificity and low molecular 
weight are desired. 

20 

No sequence-specific DNA-binding by small poly- 
peptides has been reported to date. Possible reasons that 
such polypeptides have not been found include: a) ho one 
has sought them, b) cells degrade polypeptides that are 
25 free in the cytoplasm, and c) they are too flexible and are 
not specific enough. 

In a preferred embodiment, a DNA-binding polypeptide 
is associated with a custodial domain to protect it from 
30 degradation, as discussed more fully in Examples 3 and 4. 

Properties of globular proteins influencing choice of 
Anj^tAal DBP; 

35 The majority of the well-characterized DBFs are small 

globular proteins containing one or more DNA-binding 
domains. No single-domain globular protein comprising 200 
or fewer amino acids is likely to fold into a stable 
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structure that follows either groove of DNA continuously 
for 10 bases. The structure of a small globular protein 
can be arranged to hold more than one set of recognition 
elements in appropriate positions to contact several sites 
5 along the DNA thereby achieving high specificity, however, 
the bases contacted are not necessarily sequential on the 
DNA* For example, each monomer of X repressor contains two 
sequence-specific DNA recognition regions: the recognition 
helix of the H-T-H region contacts the front face of the 

10 DNA binding site and the N-terminal arm contacts the back 
face . To obtain tight binding , a globular protein must 
contact not only the base-pair edges, but also the DNA 
backbone making sequence- independent contacts* These 
sequence-independent contacts give rise to a certain 

15 sequence- independent affinity of the protein for DNA. The 
bases that intervene between segments that are directly 
contacted influence the. position and flexibility of the 
contacted bases . If the DNA-protein complex involves 
twisting or bending the DNA f e.g. 434 repressor- DNA 

20 complex) , non-contacted bases can influence binding through 
their effects on the rigidity of the target DNA sequence. 

The phage repressors Arc, Mnt, \ repressor and Cro are 
proposed to bind to DNA at least partly via binding of 

25 extended segments of polypeptide chain. The N-terminal arm 
of X repressor makes sequence-specific contacts with bases 
in the major groove on the back side of the binding site. 
The C-terminal "tail" of \ Cro is proposed to make se- 
quence-independent contacts in the minor groove of the 

30 DNA. The structure of neither Arc nor Mnt has been 
determined; however, the sequence specificity of the N- 
terminal arm of Arc can be transferred to Mnt; viz. when 
Arc residues 1-9 are fused to Mnt residues 7 through the C- 
terminal, the fusion protein recognized the arc operator 

35 but not the mnt operator. Residues 2, 3, 4, 5, 8, and 10 of 
Arc have been proposed to contact op rator DNA and residue 
6 of Mnt has been shown to be involved in sequence-specific 
operator contacts. 
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Binding to non-pal indromic sequences requires altera- 
tion of dyad-symmetric proteins. Even non-pal indromic DNA 
has approximate dyad symmetry in the deoxyribophosphate 
5 backbone; proteins that are heterodimers or pseudo-dimers 
engineered from known globular DBFs are good candidates for 
the mutation process described here to obtain globular 
proteins that bind non-pal indromic DNA. It has been 
observed that the DNA restriction enzymes having palin- 

10 dromic recognition are composed of dyad symmetric multimers 
(MCCL86) , while restriction enzymes and other DNA-modifying 
enzymes (e.g. Xis of phage \) having asymmetric recognition 
are comprised of a single polypeptide chain or an asym- 
metric aggregate (RICH88) . Such proteins may also provide 

15 reasonable starting points to generate DBFs recognizing 
non-pal indromic sequences. 

A globular protein can bind sequence-specifically to 
DNA through one set of residues and activate transcription 

20 from an adjacent gene through a different set of residues 
(for example, X or P22 repressors) . The internal structure 
of the protein establishes the appropriate geometric 
relationship between these two sets of residues. Globular 
proteins may also bind particular small molecules, effec- 

25 tors, in such a way that the affinity of the protein for 
its specific DNA recognition subsequence is a function of 
the concentration of the particular small molecules ( e.a> 
CRP and [cAMP]). Conditional DNA-binding and gene activa- 
tion are most easily obtained by engineering changes into 

30 known globular DBFs. 

Some DBFs from bacteria and bacteriophage have been 
shown to have sufficient specificity to operate in mam- 
malian cells. 

35 . . 

An initial DBF may be chosen from natural globular 
DBFs of any cell type. The natural DBF is pr ferably 
small so that genetic engineering is facile. Preferably, 
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the 3D structxire of the natural DBP is known; this can be 
d termined from X-ray diffraction, NMR, g netic and 
biochemical studies. Preferably, the residues in the 
natural DBP that contact DNA are known. Preferably the 
5 residues that are involved in multimer contacts are known. 
. Preferably the natural operator of the natural DBP is 
known. More preferably, mutants of the natural operator 
axe known and the effects of these mutants on binding by 
natural DBP and mutant DBPs are known. Preferably, 

10 mutations of the DBP are known and the effects on protein 
folding, multimer formation, and In vivo half life-time are 
known. Kost of the above data are availed^le for \ Cro, \ 
repressor and fragments of \ repressor, 434 repressor and 
Cro proteins, E^. cold CRP and trp repressor, P22 Arc, and 

15 P22 Mnt. 

Globular DBPs are the best understood DBPs. In many 
cases, globular DBPs are capable of sufficient specificity 
and affinity for the target DNA sequence. Thus globular 
20 DBPs are the most preferred candidates for initial DBP* 
Table 8 contains a list of some preferred globular DBPs for 
use as initial DBPs. 

\ repressor and phage 434 repressor have been extent 
25 sively studied (CHAD71, PTAS80, PAB079, JOHN79, SAUE79, 
SAUE86, PAB082a,b, IiEWI83, 0HLE83, WEIS87a,b,C, REIDS8 , 
AND£87^ N£IiS86, ELIA&5) . Both proteins comprise an amino- 
terminal DNA-binding domein having four homologous alpha 
helices. Helices 2 and 3 form the H-T-H motif. DNA 
30 contacts originate in helix 2, helix 3, and adjacent 
regions with helix 3 providing most of the contacts. The 
N-terminal domains of X repressor contact each other along 
helix 5 (PAB082b) while in 434 repressor the interdomain 
contacts are beyond helix 4, there being no helix 5 
35 (ANDE87) . 

The operator DNA bends symmetrically in the 434 
represser-consensus operator co-crystal (ANDE87) . The 
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center of the 14 base pair DNA helix is ov r-wound and 
bends slightly along its axis such that it curls around the 
alpha 3 helix of ach repressor monoiaer; the ends of the 
operator DNA helix are undervound. Bending of operator DNA 
5 has also been proposed in models of Cro protein and CAP 
protein operator binding (OHLE83, 6ART88) . Consistent with 
the results of Gartenberg and Crothers, bending of the 434 
operator toward Cro is toward the minor groove and occurs 
most readily when the central bases consist exclusively of 
10 A and T (KOUD87) ; in this case, substitution of CG base 
pairs greatly reduces binding. 

X Cro (TAKE77) has been described from an X-ray 
structure of the protein without DNA (ANDES 1) . Alpha helix 

15. 2 lies across the operator major groove and may make 
contacts to operator backbone phosphates at its N-terminal 
and C-terminal ends. In addition, backbone phosphates may 
be contacted by residues at the C terminus of alpha 3, N 
terminus of beta 2, and C terminus of beta 3 (PAB084} . In 

20 computer model building of X Cro-operator DNA interactions, 
bending of operator DNA or bending at the monomer-monomer 
interface of the Cro dimer have been proposed to make the 
best fit between operator and dimer (PAB084) . 

25 Key amino acids within the H-T-H region of 434 Cro and 

X Cro are highly conserved (PAB084) , and 434 Cro binds 
operator DNA as a dimer (WHAR85a) . Because the crystals of 
434 Cro and DNA do not diffract to high resolution, atomic 
details of the protein-DNA interactions are not revealed 

30 (WOLB88). Nevertheless, Wolberger et al. report very 
significant similarities and differences between the DNA 
binding patterns of 434 repressor and 434 Cro. These 
observations on DBPs from 434, together with recent results 
on Trp repressor (OTWI88) , support the view that a) 

35 structural elements that fit into the major groove of DNA 
can function in a variety of closely related ways, b) 
bending of DNA complexed to proteins is an important 
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determinant of specificity, and c) that mechanisms of 
recognition may be quite subtle. 

Crystal strxictures have been determined for two DBFs , 
5 CRP (WEBE87a) and TrpR (OTWI88) from E^. coli > Both these 
proteins contain H-T-H motifs and bind their cognate 
operators only when particular effector molecules are bound 
to the protein, cAMP for CRP and L-tryptophan for TrpR. 
Binding of each effector molecule causes a conformational 

10 change in the protein that brings the DNA-recognizing 
elements into correct orientation for strong, sequence- 
specific binding to DNA (J0HN86) . The DNA-binding ftmction 
of Lac repressor is also modulated through protein binding 
of an effector molecule (e.g. lactose) ; unlike CRP and 

15 TrpR, Lac repressor binds DNA only in the absence of the 
effector. CRP can act either as an activator (RENYSS) or 
as a repressor (P0LA8&) depending on the relationship 
between the CRP-binding site and the rest of the promoter. 

20 Two structures of CRP (MCKA81, MCKA82) and one 

structure of a CRP mutant (WEBE87a) are available. 
Otwinowski et al. (OTffI88) have piiblished an X-ray crystal 
structxire of TrpR bound to the Trp operator. This struc- 
ture shows that, although TrpR contains a canonical H-T-H 

25 motif, the positioning of the recognition helix with 
respect to the t)NA is quite different from the positioning 
of the corresponding helix in other H-T-*H DBFs (MATT88) for 
which structures of protein-DNA complexes are available. 
Unlike previously determined structures, most of the 

30 interactions between atoms of TrpR and bases are mediated 
by localized water molecules. It is not possible to 
distinguish between localized water and atomic ions, such 
as Na**", by X-ray diffraction alone. We shall follow 
Otwinowski et al. and refer to these peaks in electron 

35 density as water, although ions cannot be ruled out. 

Bass et al . (BASS88} studied the binding of wild type 
TrpR and single amino acid missense mutants of TrpR to a 



wo 90/07862 



PCr/US90/00024 



71 

consensus palindromic Trp op rator and to palindromic 
operators that differ from the consensus by a symmetric 
substitution at one base in each half operator. Bass et 
al. conclude that the contact between the H-T-H motif of 
5 TrpR and the operators must be substantially different from 
the model that had been built based on the 434 Cro-DNA 
structure. 

Thus the binding of globular DBFs that are modulated 
10 by effector molecules is fundamentally the same as the 
binding of unmodulated globular DBFs, but the details of 
each protein's interactions with DNA are quite different. 
Prediction of which amino acids will produce strong 
specific binding is beyond the capabilities of current 
15 theory. Given the important role of localized waters or 
ions in the TrpR-DNA interface (OTWI88) and in the 434R-bNA 
interface {AGGA88) , such predictions are likely to remain 
beyond reach for some time. 

20 The Mnt repressor of F22 is an 82 residue protein 

that binds as a tetramer to an approximately palindromic 17 
base pair operator presumably in a manner that is two- fold 
rotationaliy symmetric. Although the Mnt protein is 40% 
alpha helical and has some homology to X Cro protein, Mnt 

25 is known to contact operator DNA by N-terminal residues 
(VERS87a) and possibly by a residue (K79) close to the C 
terminus (KNIG88). It is unlikely, therefore, that an 
H-T-H structure in Mnt mediates DNA binding (VERS87a) . 
Another residue (Y78) close to the C-terminal end has been 

30 found to stabilize tetramer formation (KNIG88) . Though the 
three dimensional structure of Mnt is not known, DNA- 
binding experiments have indicated that the Mnt operator, 
in B-form conformation, is contacted at major groove 
nucleotides on both front and back sides of the operator 

35 helix (VERS87a) . 

The Arc repressor of F22 is a 53 residue protein that 
binds as a dimer to a partially palindromic 21 bas pair 
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op rator adjacent to the mnt operator in P22 and protects a 
region of the operat r that is only partially synmietric 
relative to the symmetric sequences in the operator 
(VERS87b) . Arc is 40% homologous to the N-terminal portion 
5 of Mnt, and the K-terminal residues of the Arc protein 
contact operator DNA such that an H-T-H binding motif is 
unlikely , as in Mnt binding (VERSSSb) . The three dimen- 
sional structure of Arc, like Mnt, is not known, but a 
crystallographic study is in progress (J0RD85) . DNA- 

10 binding experiments have shown that Arc probably binds 
along one face of B-form operator DNA, These ej^ieriments 
indicate that Arc contacts operator phosphates farther out 
from the center of operator symmetry than do the repressors 
or Cro proteins of X or 434, or P22 Mnt protein.- Thus the 

15 researchers state that the operator DNA may be bent around 
Arc in binding or Arc dimer may have an extended structure 
to allow such contacts to occur (VERS87b) . These alterna- 
tives are not mutually exclusive. 

20 DNA-Bindina Proteins Other Than Repressor Proteins 

Any protein (or polypeptide) which binds DNA may be 
used as an initial DNA-binding protein; the present method 
is not limited to repressor proteins, but rather includes 
25 other regulatory proteins as well as DNA-binding enzymes 
such as polymerases and nucleases* 

Derivatives of restriction enzymes may be used as 
initial DBPs. All }cnown restriction enzymes recognize 

30 eight or fewer base pairs and cut genomic DNA at many 
places. Expression of a functional restriction enzyme at 
high levels is lethal unless the corresponding sequence- 
specific DNA-modifying enzyme is also expressed. EcoR l 
that lacks residues 1-29, denoted EcoRI-delN29 , has no 

35 nuclease activity (JENJ86) ; EcoR I-delN29 binds sequence- 
sp cif ically to DNA that . includes the EcoR I recognition 
segu nee, GAATTC, CBECK88} . 
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From the structure of R, EcoR I (MCCL86) , we can see 
that extension of the polypeptide chain at either the amino 
or carboxy terminus would allow contacts with base pairs 
outside of the canonical hexanucleotide. 

5 

Specifically, extending Sc6RI(AT139) , EcoRI(GS140) , or 
EcoRI(RQ203) (yAN087) by, for example, ten highly varie- 
gated residues at the amino terminus and selecting for 
binding to a target such as, TGAATTCA or GGAATTCC, allows 

10 isolation of a protein having novel DNA-recognition 
properties. Alternatively, EcoR I may be extended at the 
amino terminus by addition of a zinc-finger domain. It may 
be useful to have two or more tandem repeats of the 
octanucleotide target placed in or near the promoter region 

15 of the selectable gene. Fox (FOXK88) has used DNase-I to 
footprint EcoR I bound to DNA and reports that 15 bp are 
protected. Thus, repeated octanucleotide targets for 
proteins derived from EcoR I should be separated by eight or 
more base pairs; one could place one copy of the target 

20 upstream of the -35 region and one copy downstream of the- 
10 region. There are many residues in EcoR I that contact 
the DNA as the enzyme wraps around it. These residues 
could be varied to alter the binding of the protein. To 
obtain acceptable specificity, we may need to pick as 

25 initial DBF a mutant of Eco RI that folds and dimerizes, but 
that binds DNA weakly. The mutations in regions of the 
protein that contact DNA outside of the original GAATTC 
will confer the desired affinity and specificity on the 
novel protein. 

30 

One may wish to obtain a protein that binds to one 
target DNA sequence, but not to other sequences that 
contain a subsequence of the target. For example, we may 
seek a protein that recognizes TGAATTCA, but not any of the 
35 sequences vGAATTCb. To achieve this distinction, we place 
the target sequence in the promoter region of the selec- 
table gene and one or more instances of the related 
s quenc s, to which we intend that the protein not bind, in 
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the promoter region of an ess ntial gene, such as an 
antibiotic-resistance gene. 

Other stable proteins may also be used as initial 
5 DBFs, even if they show no DNA-binding properties. 
Parraga ^ (Reference 8 in PARRS 8) report that Eisen et 

al. have fused 229 residues of yeast ADRl to beta-galac- 
tosidase and that the fusion protein binds sequence-speci- 
f ically to DNA in vitro , 

10 

Adenovirus £1A protein turns on early viral genes as 
well as the human heat shock protein hsp70 (SIM088) • 
Further^ a normal inducible nuclear DNA-binding protein 
regulates the IL-2alpha interleukin-2 receptor-R( alpha) 

15 gene and also promotes activation of trsmscription from the 
HIV-l virus LTR (BOHN88) . These studies indicate one of 
the many dif f icxilties of designing antiviral chemotherapy 
by using the transcriptional regulatory apparatus of the 
virus as a target. This invention uses unique target 

20 sequences, not represented elsewhere in the host genome, as 
targets for suppression of gene expression. 

The DNA sequences of operators that interact with 
proteins that control mating-type and cell-type specific 

25 transcription in yeast (MXLL85} reveal that the consensus 
site for action o± the alpha2 protein dimer is symmetric, 
while a heterodimeric complex of alpha2 and al subunits 
acts on an asymmetric site. The alpha2al-responsive site 
consists of a half-site that is identical to the alpha2 

30 half-site, and another half -site that is a consensus for al 
protein binding. The spacings between the symmetric and 
asymmetric sites are not the same. 

Antibodies that bind DNA and other nucleic acids have 
35 been obtained from human patients suffering from Systemic 
Lupus Erythematosus. Murine monoclonal antibodies have 
been obtained that specifically recognize Z-DNA, B-DNA, 
ssDNA, triplex DNA, and certain rep ating sequences 
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(ANDE88) . Anderson et al, (ANDE88) report that: 1) the 
antibodies studied contact six base pairs and four phos- 
phates, 2) antibodies are unlikely to provide some of the 
well known motifs for DNA-binding, e,a. helix-turn-helix, 
5 3) study of DNA-antibody complexes may yield insights into 
mechanisms of recognition, and 4) a DNA-recognizing 
antibody might be converted into a sequence or structure 
specific nuclease. The shortness of the contact makes it 
unlikely that high specificity can be attained. 

10 

Properties of serially-linked globular domains; 

A protein motif for DNA binding, present in some 
eukaryotic transcription factors, is the zinc finger in 

15 which zinc coordinately binds cysteine and histidine 
residues to form a conserved structure that is able to 
bind DNA (FRAN88) . Xenopus laevis transcription factor 
TFIIIA is the first protein demonstrated to use this motif 
for DNA binding, but other proteins such as human tran- 

20 scription factor SPl, yeast transcription activation factor 
GAL4, and estrogen receptor protein have been shown to 
require zinc for DNA binding in vitro (EVAN88) . Other mam- 
malian and avian steroid hormone receptors and the adeno- 
virus ElA protein, that bind DNA at specific sites, contain 

25 cysteine-rich regions which may form metal chelating loops. 

Zinc-finger regions have been observed in the sequen- 
ces of a number of evikaryotic DBPs, but no high-resolution 
3D structure of a Zn-finger protein is yet available. A 

30 variety of models have been proposed for the binding of 
zinc-finger proteins to DNA (FAIR86, PARR88, BERG88, 
GIBS88) . Model building suggests which residues in the Zn- 
f ingers contact the DNA and these would provide the primary 
set of residues for variation. Berg {BERG88) and Gibson et 

35 al. (GIBS88) have presented models having many similarities 
but also some significant differenc s. Both models suggest 
that the motif comprises an antiparallel beta structure 
followed by an alpha helix and that the front side of the 
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helix contacts the major groove of the DNA. By assuming 
that conserved basic residues of the Zn-f inger make contact 
with phosphate groups in each copy of the motif, Gibson et 
al, deduce that the cuaino terminal part of the helix makes 
5 direct contact to the DNA. The Gibson model does not, 
however, account well for the nxamber of bases contacted by 
Zn-f inger proteins. The observations on H-T-H proteins 
suggest that a DNA-recognizing element can interact in a 
variety of ways with DNA and we assert that a similar 

10 situation is likely in Zn-f inger proteins. Thus, until a 
3D model of a Zn-f inger protein bound to DNA is available, 
all of the residues modeled as occurring on the alpha helix 
away from the beta structure should be considered as 
primary candidates for variegation when one wishes to alter 

15 the ,DNA-binding properties of a Zn-f inger protein. In 
addition, residues in the beta segment may control inter- 
actions with the sugar-phosphate backbone which can effect 
both specific and non-specific binding. 

20 Parraga et al. (PARR88) have reported a low-resolution 

structure of a single zinc-finger from NMR data. They 
confirm the alpha helix proposed by Berg and by Gibson et 
al. . but not the antiparallel beta sheet. The models 
proposed by Klug and colleagues (FAIRS 6) have a common 

25 feature that is at variance with the models of Berg and of 
Gibson et al. . viz. that the protein chain exits each 
finger domain at the same end that it entered. The 
structure published by Parraga et al. does not settle this 
point, but suggests that the exit strand tends toward the 

30 end opposite from the entrance strand, thereby supporting 
the overall models of Berg and of Gibson et al. Parraga et 
al. also - report that a) a chimeric molecule consisting of 
zinc-finger domains linked to beta-galactosidase binds 
sequence-specif ically to. DNA and b) a protein comprising 

35 only two finger motifs can bind sequence-specif ically to 
DNA. They do not suggest that the residues could be 
mutagenized to achieve novel recognition. 
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A protein composed of a series of zinc fingers offers 
the greatest pot ntial of uniquely recognizing a single 
site in a large genome. A series of zinc fingers is not so 
well suited to development of a DBF that is sensitive to an 
5 effector molecule as is a more compact globular protein 
such as Ejs. coli CRP<. Positive control of genes adjacent to 
the target DNA subsequence can be achieved as in the case 
of TF-IIIA. 

10 Overview; Variegation Strategy 

Choice of residues in parental potential*-DBP to vary; 

We choose residues in the initial potential-DBP to 

15 vary through consideration of several factors, including: 
a) the 3D structure of the initial DBP, b) sequences 
homologous to the initial DBF, c) modeling of the initial 
DBF and mutants of the initial DBF, d) models of the 3D 
structure of the target DNA, and e) models of the complex 

20 of the initial DBF with DNA. Residues may be varied for 
several reasons, including: a) to establish novel recogni- 
tion by changing the residues involved directly in DNA 
contacts while keeping the protein structure approximately 
constant, b) to adjust the positions of the residues that 

25 contact DNA by altering the protein structure while keeping 
the DNA-cont acting residues constant, c) to produce hetero- 
dimeric DBFs by altering residues in the dimerization 
interface while keeping DNA-contacting residues constant, 
and d) to produce pseudo-dimeric DBFs (see below) by 

30 varying the residues that join segments of dimeric DBFs 
while keeping the DNA-contacting residues and other 
residues fixed. 

If a dimeric protein comprises two identical polypep- 
35 tide chains related by a two-fold axis of rotation, we 
speak of a homodim r with two-fold dyad symmetry. When two 
very similar polypeptides fold int similar domains and 
associate, we may observe that there is an approximate two- 
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fold rotational axis that relates homologous residues, such 
as the alphal-betal dimer of haemoglobin. We refer to such 
a protein as a heterodimer and to the symmetry axis as a 
quasi-dyad. When vre produce a single-chain DBP by fusing 
5 gene fragments that encode two DNA-binding domains joined 
by a linker amino acid subsequence, we call the molecule a 
pseudo-dimer and the axis that relates pairs of residues a 
pseudo-dyad. 

10 Principles that guide choice of residues to vary: 

• A key concept is that only structured proteins 
exhibit specific binding, i.e. can bind to a particular 
chemical entity to the exclusion of most others. In the 

15 case of polypeptides, the structxire may require stabiliza- 
tion in a complex with DNA. The residues to be varied are 
chosen to preserve the underlying initial DBP structure or 
to enhance the likelihood of favorable polypeptide^DNA 
interactions. The selection process eliminates cells 

20 carrying genes with mutations that prevent the DBP from 
folding. Genes that code for proteins or polypeptides that 
bind indiscriminately are eliminated since cells carrying 
such proteins are not viable. Although preservation of the 
basic underlying initial DBP structure is intended, small 

25 changes in the geometry of the structure can be tolerated- 
For example ,^ the spatial relationship between the alpha 3 
helix in one monomer of X Cro and the alpha 3 helix in the 
dyad-related monomer (denoted alpha 3 • ) is a candidate for 
variation. Small changes in the dimerization interface 

3 0 can lead to changes of up to several in the relative 
positions of residues in alpha 3 and alpha 3 « . 

Burial of hydrophobic surfaces so that bulk water is 
excluded is one of the strongest forces driving the 
35 folding of macromolecules and the binding of proteins to 
other molecul s. Bulk water can be excluded from the 
region b tween two molecules or betwe n two portions of a 
single molecule only if th surfaces are complementeary. 
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The double helix of B-DNA allows most of the hydrophobic 
surface nucleotides to be buried. The edges of the bases 
hav several hydrogen-bonding groups ; the methyl group of 
thymine is ah important hydrophobic group in DNA (HARR88) . 
5 To achieve tight binding, the shape of the protein must be 
highly complementary to the DNA, all or almost all hydro- 
gen-bonding groups on both the DNA and the protein must 
make hydrogen bonds, and charged groups must contact 
either groups of opposite charge or groups of suitable 
10 polarity or polarizability. 

There are two complementary interfaces of ma j or 
interest: a) the DNA-protein interface and b) the interface 
between protein monomers of dimers or between domains of 
15 pseudo-dimers . The DNA-protein interface is more polar 
than most protein-protein interfaces, but hydrophobic amino 
acids f e.g, F, L, M, V, I, W, Y) occur in sequence-specific 
DNA-protein interfaces. The protein-protein interfaces of 
natural DBFs are typical protein-protein interfaces. 

20 

Amino acids are classified as hydrophilic or hydro- 
phobic (ROSE85, EISE86a,b), and although this classifica- 
tion is helpful in analyzing primary protein structures, it 
ignores that the side groups may contain both hydrophobic 

25 and hydrophilic portions, e.g. , lysine. Hydrogen bonds and 
other ionic interactions have strong directional behavior, 
while hydrophobic interactions are not directional. Thus 
substitution of one hydrophobic side group for another 
hydrophobic side group of similar size in an interface is 

30 frequently tolerated and causes subtle changes in the 
interface. For the purposes of the present invention, such 
hydrophobic-interchange stibstitutions are made in the 
protein-protein interface of DBFs so that a) the geometry 
of the two monomers in the dimer will change, and b) 

35 compensating interactions produce exclusively heterodimers . 

Th process claimed here t sts as many surfaces as 
possible to select one as efficiently as possible that 
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binds to the target. The sel ction isolates cells produc- 
ing those proteins that are more nearly complementary to 
the target DNA, or proteins in which intermolecular or 
intramoleciilar interfaces are more nearly complementary to 
5 each other so that the protein can fold into a structure 
that can bind DNA. The effective diversity of a variegated 
population is measured by the number of different surfaces, 
rather than the number of protein sequences. Thus we 
should maximize the number of surfaces generated in our 
10 population, rather than the number of protein sequences. 
Proteins do not have distinct, countable surfaces; there- 
fore, we define an interaction set as a collection of 
residues of a protein that can simultaneously touch the 
target DNA. 

15 

If N spatially separated residues of a protein are 
varied, 20 x N surfaces are generated. Variation of N 
residues in the same interaction set yields 20^ surfaces. 
For example, if N = 6,^ variation of spatially separated 
20 residues yields 120 surfaces while variation of interacting 
residues yields 20^ « 6.4 x 10*^ surfaces. The process of 
varying residues in an interaction set to maximize the 
nvimber of surfaces obtained is referred to as Structure- 
directed Mutagenesis. 

25 

If the protein residues to be varied are close enough 
together in sequence that the variegated DNA (vgDNA) 
encoding all of them can be made in one piece, then 
cassette mutagenesis is picked. The present invention is 
30 not limited to a particular length of vgDNA that can be 
synthesized. With current technology, a stretch of 60 
amino acids (ISO DNA bases) can be spanned. 

Mutation of residues further than sixty residues 
35 apart can be achieved using other methods, such as single- 
stranded-*oligonucleotid -directed mutagenesis (B0TS85) and 
two or more mutating primers. 
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To vary residues separated by more than sixty resi- 
dues, two cass ttes may be mutated serially. From 2-fold 
to 1000-fold variegation is first introduced into a first 
cassette. We then introduce 1000-fold to 10^-fold varie- 
5 gation into a second cassette of the variegated vector 
population. The composite level of variation preferably 
does not exceed the prevailing capabilities to a) produce 
very large numbers of independently transformed cells or b) 
select small components in a highly varied population. 
10 The limits on the level of variegation are discussed 
below. ' ^ 

Assembly of Relevant Data: 

15 Here we assemble the data about the initial D6P and 

the target that are useful in deciding which residues to 
vary in the variegation cycle: 

1) 3D structure, or at least a list of residues that 

20 contact DNA and that are involved in the dimer contact 

of the initial DBP, 

2) list of sequences homologous to the initial DBP, and 
25 3) model of the target DNA sequence. 

These data and an understanding of the function and 
structure of different amino acids in proteins will be 
30 used to answer three questions: 

1) which residues of the initial DBP are on the outside 
and close enough together in space to touch the target 
DNA simultaneously? 

35 

2) which residues of the initial DBP can be varied with 
high probability of retaining the underlying initial 
DBP structure? / 
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3) which residues of the initial DBP can affect the 
dimerization or folding of the initial DBP? 

5 Although an atomic model of the target material is 

preferred in such examination^ it is not necessary. 

Graphical and computational tools; 

10 The most appropriate method of picking the residues of 

the protein chain at which the amino acids should be varied 
is by viewing with interactive computer graphics a model of 
the initial DBP complexed with operator DNA. A model based 
on X-ray data from the DNA-protein complex is preferred, 

15 but other models may be used. A stick-figure representa- 
tion of molecules is preferred. Suitable programs for 
viewing and manipulating protein and nucleic acid models 
include: a) PS-FRODO, written by T. A. Jones (JONES 5) and 
distributed by the Biochemistry Department of Rice Univer- 

20 sity, Houston, TX; and b) PROTEUS, developed by Dayringer, 
Tramantano, and Fletterick (DAYR86) . Any hardware that 
supports either of these programs is appropriate. 

Use of Knowledge of Mutations Affecting Protein 
25 Stability 

In choosing the residues to vary and the substitu- 
tions to be made for such residues, one may make use not 
only of modelling as described above but also of experi- 
30 mental data concerning the effects of mutation in the 
initial DNA-binding protein. Mutations which will markedly 
reduce protein stability are to be avoided in most cases. 

Missense mutations that decrease DNA-binding protein 
35 function non-specifically by affecting protein folding are 
distinguished from binding-specific mutations primarily on 
the basis of protein stability (NELS83, PAKU86, VERSBSb, 
HECH84, HECH85a, and HECH85b) . 
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Tables 1, 12, and 13 sumnarize the results of a 
nximber of studies on single missense mutations in the 
three bacteriophage repression proteins: X repressor 
5 (Table 12) (NELS83, GUAR82, H£CH85a, and NELS85} , X Cro 
(Table 1) (PAKU86, EISE85) , and P22 Arc repressor (Table 
13} (V£RS86a, V£RS86b} . The majority of the mutant 
sequences shown in Tables 1, 12, and 13 were obtained in 
experiments designed to detect loss of function in vivo. 
10 The second-site pseudo-reversion mutations (HECH85a) , and 
suppressed nonsense mutations (NELS83), restore function, 
and some of the site specific changes (EISE85) produce 
functional proteins. 

15 Roughly 50-70% of the single missense mutations of 

the DNA-binding proteins selected for loss of function 
(Tables 1, 12, and 13) produce protein folding defects. 

Use of Knowledge of Mutations Affecting the DNA-Protein 
20 Interface 

Missense mutations in residues thought to be involved 
in specific interactions with DNA have been reported for 
several prolcaryotic repressor proteins. Table 14 shows an 

25 alignment of the H-T-H DNA-binding domains of four prokary- 
otic repressor proteins (froia top to bottom: X repressor, 
X Cro, 434 repressor and txp repressor) and indicates the 
positions of missense mutations in residues that are 
solvent-exposed in the free protein but become buried in 

30 the protein-DNA complex, and that affect DNA binding. 

Randomly obtained missense mutations in solvent- 
exposed residues of X ^repressor, X Cro, and trp repressor, 
yield sets of mutants that reduce DNA binding (Table 14), 
35 These sets correlate well to the sets of residues that are 
proposed to interact directly with DNA. Some mutations in 
X Cro (EISE85) and all those shown for 434 repressor 
(WHAR85a) were obtained through site-directed mutagenesis. 
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Most of the , mutations shown in the X and trp repressor 
sequences are trans -dominant when the mutant gene is 
present on an overproducing plasmid (NELS83, KELL85) . The 
exceptions to trans-dominance are the X repressor SP35 and 
5 the trp repressor AT80 mutations. This latter change 
produces a repressor that has only slightly reduced binding 
(KELL85) . The trans -dominance observed for these mutations 
is proposed by the authors to result from the wild-type 
repressor and the mutant repressor forming mixed oligomers 
10 which are inactive in binding to operator sites. 

Wharton (WHAR85a) has reported that extensive site- 
directed mutagenesis of 434 repressor positions 28 and 29 
produced no functional protein sequences other than the 
15 wild-type. Apparently, in the context of 434 repressor 
structure and operators, only proteins with the wild-type 
Q28-Q29 secjuence bind to the wild-type operators. 

Table 14 also shows missense mutations that result in 
20 near normal repressor activity. Substitution of 434 
repressor Q33 with H, L, V, T, or A produces repressors 
that function if expressed from overproducing plasmids 
(WHAR85a) ; repressor specificity is, however, reduced. 
Mutations in X repressor, Qy33 (NEriS83, HECH83) , and in \ 
25 Cro, YF26 (EISE85) , produce altered proteins which make 
one less H-bond to the DNA and which bind to the operator 
DNA with reduced affinity. Thus, loss of a single H-bond 
is insufficient to completely abolish binding of DNA. 
Mutations YK26 and HR35 in X Cro show nearly normal binding 
30 (EISE85) . 

Nelson and Sauer (NELS85) and Hecht et al . (HECH85a, 
b) have described four replacements in X repressor (Table 
12): EK34, GN48, GS48, and EK83. These derivates have 
35 higher affinity for 0^1 than w.t. \ repressor. 

Extended amino acid arms at N- and C-terminal loca- 
tions are important DNA-binding structures in at least four 
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prokaryotic repressors: X repressor and Cro, and P22 Arc 
and Mnt. 

Sequence-specific and sequence- independent contacts 
5 are made by the first 6 amino acid residues (STKKKP) of the 
X repressor N-terminal region which form an "arm" that can 
wrap around the DNA (ELIA85, PAB082a) . Missense mutations 
KE4 and LF12 (Table 12) both greatly reduce repreissor 
activity in vivo (NELS83) . Deletion of the first six 

10 residues results in a protein which is non-functional in 
vivo (ELIA85) . Deletion of the first three residues 
results in decrease of affinity for 0^1, loss of protection 
of back side guanines, altered specificity between Oj^l and 
Or3, and decreased binding sensitivity to changes in 

15 temperature or salt concentration (ELIA85, PAB082a) • 

Missense mutations of P22 Arc that produce non-func- 
tional proteins with high intracellular specific protein 
levels (Table 13) are found only in the N-terminal 10 

20 residues of the protein (VERS86b) . A single residue change 
at position 6 (HP6) in P22 Mnt changes operator recognition 
in the altered protein (YOUD83, VERS86a,b) . Knight and 
Sauer (cited in VERS86a,b) replaced the first 6 residues of 
Mnt repressor with the first 9 residues of Arc repressor to 

25 produce a repressor that binds to the arc operator but not 
to the mnt operator. Thus P22 Mnt and Arc use a recogni- 
tion region located in the first 6-10 amino-terminal 
residues for DNA recognition and binding. The N-terminal 
DNA-binding of these proteins ceui not be the recognition 

30 helix of a typical H-T-H motif. 

In X Cro, a C-terminal sequence (K62'-K63-T64-T65-A66) 
has been suggested on the basis of model building (TAKE85) 
and NMR measurements (LEIG87) to form a flexible arm that 
35 interacts with minor groove phosphates. Eisenbeis and 
Caruthers (cited in KNIG88) have found that T64, T65, and 
A66 have minor effects on protein-operator affinity, while 
K63 is very important. The C-terminal sequence of P22 Mnt 
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(K79-K80-T81-T82) is almost identical to that of \ Cro. 
It has been shown (KNIG88) that deletion of the three 
residues after K79 has little effect on protein structvtre 
or DNA binding. Deletion of K79 and the distal residues, 
5 however, reduces operator binding by three orders of 
magnitude with little apparent change in protein structure. 

SSS of Knowledge of Mutations Af fecting the Protein- 
Protein Interface 

10 

It is also possible to modulate DNA-'binding specifi- 
city by altering the protein-protein interface. Because 
the oligomerization equilibrium is coupled to DNA binding, 
mutations that alter oligomerization affect operator site 

15 affinity. Since oligomerization involves the matching of 
protein surfaces, many interactions are hydrophobic and 
mutations which specifically destabilize oligomerization 
are similar to mutations which destabilize global protein 
structure. Interactions at the site of oligomerization can 

20 influence the strength of interactions at the DNA-binding 
site by subtle alterations in protein structure. 

Use of Mutations That Affect Activation 

25 When Xf 434, and P22 repressors bind to their respec- 

tive Or2 sites, they activate transcription (P0TE80, 
POTE82, PTAS80) . The site on X repressor which activates 
RNA polymerase is located on the N-terminal domain of the 
molecule (BUSH88, HOCH83, SAUE79) . Activation requires 

30 contact between the N-terminal domain of repressor at Oj^2 
and RNA polymerase CHOCH83, SAUE79) and this contact 
stimulates isomerization of the polymerase complex to the 
open form (McClure and Hawley, cited in GUAR82) . 

35 Missense mutations in X/ P22, or 434 repressors that 

specifically reduce Pj^ activation while leaving perator 
binding intact are in th solvent-exposed protein surface 
closest to RNA polymerase bound at Pj^m (GUAR82, PAB079, 
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BUSH88, WHAR85a) . For \ and 434 repressor this surface 
includ s residues in alpha helix 2 and in the turn between 
alpha helices 2 and 3. In P22 repressor, the surface is 
formed at the carboxyl terminus of alpha helix 3 (PAB079, 
5 TAKE83). In each repressor, the changes that reduce 
transcriptional activation at Prjj involve the substitution 
of a basic residue for a neutral or acid residue. Further, 
missense mutations in \ and 434 repressors which increase 
transcription at Pj^ involve the substitution of an acidic 
10 residue for a neutral or basic residue (GUAR82, BUSH88) . 

Transcriptional activation at Pj^ involves the 
apposition of a negatively charged surface on the N- 
terminal domain of X, 434, or P22 repressor to a site on 

15 RNA polymerase (BUSH88)/ Mutations that a) alter the 
negatively-charged surface of repressor by removing acidic 
residues or by replacing them with basic residues, or b) 
that position the negative surface incorrectly with respect 
to RNA polymerase, decrease transcriptional activation at 

20 PRM» Alterations that produce a more negatively charged 
surface act to increase transcription at Prm* 

Pick principal set of residues to vary: 

25 A huge number of variant DNA sequences can be gener- 

ated by synthesis with mixed reagents at chosen bases. 
Usually, it is necessary that the number of variants not 
exceed the number of independently transformed cells 
generated from the synthetic DNA. It is efficient, 

30 however, to make the number of variants as close as 
practical to this limit. The total number of variants is 
the product of the number of variants at each varied codon 
over all the variable codpns. Thus, we first consider 
which residues could be varied with an expectation that 

35 alteration could affect DNA binding. We then pick a range 
of aiaino acids at each variable residue. The total number 
of variants is the product f these numbers. If the 
product is too large or too small, we alter the list of 
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residues and range of variation at each variable residue 
tmtil an acceptable number is found • 

Considering which residues are on .the surface of the 
5 initial DBP^ we pick residues that are close enough 
together on the surface of the initial DBP to touch a 
molecule of the target simultaneously without having any 
initial DBP main-chain atom come closer than van der Waals 
distance ( viz. 4.0 to 5.0 center to center) to any target 
10 atom. For the purposes of the present invention, a residue 
of the initial DBP "touches" the target if: 

a) a main-chain atom is within van der Waals distance, 
viz. 4.0 to 5.0 &, of any atom of the target molecule, 
15 b) the Cj^eta within a specific distance of any atom of 
the target molecule so that a side-group atom could 
make contact with that atom, or 
c) there is evidence that altering the residue alters the 
DNA-^binding of the initial DBP. 

20 

The residues in the principal set need not be contig- 
uous in the protein sequence. The exposed surfaces of the 
residues to be varied need not be connected. We prefer 
only that the amino acids in the residues to be varied all 
25 be capable of touching a single copy of the target DNA 
sequence simultaneously without atoms overlapping. 

In addition to the geometrical criteria, we prefer 
that there be indications that the initial DBP structure 
30 will tolerate substitutions at each residue in the princi- 
pal set of residues. Indications could come from various 
sources, including homologous sequences and modeling. 

Pick a secondary set of residues to varv: 

35 

The secondary set comprises those residues not in the 
primary set that touch residues in the primary set. These 
residues might be excluded from the primary set because the 
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residue is : a) internal, b) highly conserved, or c) on the 
surface, but the curvature of the initial DBP surface 
prevents the residue from being in contact with the target 
at the same time as one or more residues in the primary 
5 set. 

Internal residues are frequently conserved and the 
amino acid type can not be changed to a significantly 
different type without risk that the protein structure will 

10 be disrupted. Nevertheless, some conservative changes of 
internal residues, such as I to L or F to Y, are tolerated. 
Such conservative changes affect the detailed placement and 
dynamics of adjacent protein residues and such variation 
may be useful to improve the characteristics of DBP 

15 binding. 

Surface residues in the secondary set are most often 
located on the periphery of the principal set. Such 
peripheral residues can not make direct contact with the 

20 target simultaneously with all the other residues of the 
principal set. It is appropriate to vary the charge of 
some or all of these residues. For example, the variegated 
codon containing eguimolar A and G at base 1, equimolar C 
and A at base 2, and A at base 3 yields amino acids T, A, 

25 K, and E with equal probability. 

Choice of residues to vary simultaneously; 

The allowed level of variegation determines how many 
30 residues can be varied at once; geometry determines which 
ones. The user may pick residues to vary in many ways; 
the following is a preferred manner. The user picks the 
objective of the variegation, vide supra . 

35 The number of residues picked is coupled to the range 

through which each can be varied. In the first round 
progressivity is not an issue; the user may elect to 
produce a level of variegation such that ach molecule of 
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vgDNA* is potentially different through^ for excunple, un- 
limited variegation of 10 codons (20^^ approx. 10^-^ 
different protein sequences). The levels of efficiency of 
ligation and transformation reduce the number of DNA 
5 sequences actually tested to between lo'^ and 10^. Multiple 
performances of the process with very high levels of 
variegation will not yield repeatable results; the user 
decides whether this is important. 

10 Pick range of variation? 

Each varied residue can have a different scheme of 
variegation, producing 2 to 20 different possibilities. We 
require that the process be progressive, i.e. each variega- 
15 tion cycle produces a better starting point for the next 
variegation cycle than the previous cycle produced. 

N.B. I Setting the level of variegation such that 
the parental pdbp and many sequences related to 

20 the parental pdbp sequence are present in 

detectable amounts insures that the process is 
progressive. If the level of variegation is so 
high that the frequency o f the parental pdbp 
sequence can not be detected as a trans formant, 

25 then each round of mutagenesis is independent of 

previous roxinds and there is no assxirance of 
progressivity. This approach can lead to 
valuable DNA-binding proteins, but multiple 
repetitions of the process at this level of 

30 variegation will not yield progressive results. 

Excessive variegation is not preferred in 
subsequent iterations of this process. 

Progressivity is not an all-or-nothing property. So 
35 long as most of the information obtained from previous 
variegation cycles is retain d emd many different surfaces 
that are related to the parental DBP surface are produc d, 
the proc ss is progressive. If the level of variegation is 
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so high that the parental dbp gene may not be detected, the 
assurance of progressivity diminishes. If the probability 
of recovering the parental DBP is negligible, then the 
probability of progressive results is also negligible. 

5 

An opposing force in our design considerations is 
that DBFs are useful in the population only up to the 
sunount that can be detected; any excess above the detec- 
table amount is wasted. Thus we produce as many surfaces 
10 related to the parental DBP as possible within the con- 
straint that the parental DBP be present as a marker for 
the detection level. 

Mutagenesis of DNA: 

15 

We now. decide how to distribute the variegation 
within the codons for the residues to be varied. These 
decisions are influenced by the nature of the genetic 
code. When vgDNA is synthesized, variation at the first 

20 base of a codon creates a population coding for cuaino acids 
from the same column of the genetic code table (Table 16) ; 
variation at the second base of the codon creates a 
population coding for amino acids from the same row of the 
genetic code table; variation at the third base of the 

25 codon creates a population coding for amino acids from the 
same box. Work with 3D protein structural models may 
suggest definite sets of amino acids to siibstitute at a 
given residue, but the method of variation may require 
either more or fewer kinds of amino acids be included. For 

30 example, substitution of N or Q at a given residue may be 
wanted. Combinatorial variation of codons requires that 
mixing N and Q at one location also include K and H as 
possibilities at the same residue. The present invention 
does not rely on accurate predictions of the amino acids to 

35 be placed at each residue, rather attention is focused on 
which residues should be varied. 



There are many ways to generate diversity in a 
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protein (RICH86,. CARtJ85, 0LIP86) . An ejctreme case is that 
one or a few residues of the protein are varied as much as 
possible r inter alia see CARU85, CARU87, RICH86, WHAR85a) . 
We will call this limit "Focused Kutagenesiis" • When there 
5 is no binding between the parental DBF and the target, we 
preferably pick a set of five to seven residues on the 
surface and vary each through all 20 possibilities. 

An alternative plan of mutagenesis ("Diffuse Mutagene* 
10 sis") that may be useful is to vary many more residues 
through a more limited set of choices (VERS86a,b, INOU86 
(Ch.l5), PAKUSS) . This can be accomplished by spiking 
each of the pure nucleotides activated for DNA synthesis 
( e,*a. nucleotide-phosphoramidites) with one or more of the 
15 other activated nucleotides. Contrary to general practice, 
the* present invention sets the level of spiking so that 
only a small percentage { 1% to .00001%, for example) of 
the final product will contain the parental DNA sequence. 
This will insure that the majority of molecules carry 
20 single, double, triple, and higher mutations and, as 
required for progress ivity, that recovery of the parental 
sequence will be a possible outcome. 

Let Nb be the ntimber of bases to be varied^ and let Q 
25 be the fraction of all DNA sequences that should have the 
parental sequence, then M, the fraction of the nucleotide 
mixture that is the majority component, is 

M = exp{ loge(Q)/Nb ) « 10 (logio(Q)/%) ^ 

30 

If, for example, thirty base pairs on the DNA chain were to 
be varied and 1% of the product is to have the parental 
sequence, then ,each mixed nucleotide substrate should 
contain 86% of the parental nucleotide and 14% of other 
35 nucleotides. Table 17 shows the fraction (fn) of DNA 
molecul s having n non-parental bases when 30 bases are 
synthesized with reagents that contain fracti n .M of the 
majority component. When M='. 63096, f24 and higher are less 
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than 10"®. Note that substantial probability for 8 or more 
substitutions occurs only if the fraction of parental 
sequence (fO) drops to around 10"^. 

5 The Nb base pairs of the DNA chain that are synthe- 

sized with mixed reagents need not be contiguous. They are 
picked so that between Nj^/S • and Nj^ codons are affected to 
various degrees. The residues picked for mutation are 
picked with reference to the 3D structure of the initial 

10 DBP, if known. For example, one might pick all or most of 
the residues in the principal and secondazry set. We may 
impose restrictions on the extent of variation at each of 
these residues based on homologous sequences or other data. 
The mixture of non-parental nucleotides need not be random, 

15 rather mixtures can be biased to give particular amino acid 
types specific probabilities of appearance at each codon. 
For example, one residue may contain a hydrophobic amino 
acid in all known homologous sequences; in such a case, the 
first and third base of that codon would be varied, but the 

20 second would be set to T. This Diffuse Mutagenesis will 
reveal the subtle changes possible in the protein backbone 
associated with conservative interior changes, such as V to 
I, as well as some not so subtle changes that require 
concomitant changes at two or more residues of the protein. 

25 

Focused Mutagenesis; 

If we have no information indicating that a particular 
amino acid or class of amino acid is appropriate, we 
3 0 approximate substitution of all amino acids with equal 
probability because representation of one or a few pdbp 
genes above the detectable level is unproductive. Equal 
amounts of all four nucleotides at each position in a codon 
yields the amino acid distribution: 



35 
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4/64 A 
2/64 H 
4/64 P 
1/64 W 



2/64 C 
3/64 I 
2/64 Q 
2/64 Y 



2/64 D 
2/64 K 
6/64 R 



2/64 E 
6/64 L 
6/64 S 



2/64 F 
1/64 M 
4/64 T 



4/64 G 
2/64 N 
4/64 V 



3/64 Stop 



5 



This distribution has the disadvantage of giving two basic 
residues for -every acidie residue. Such predominance of 
basic residues is likely to promote sequence-independent 
DNA binding. In addition, six times as much R, S, and L as 
10 W or M occur for the random distribution. Use of equimolar 
C and G at the third base reduces the over-representation 
of S, R, and L, but does not cure the maldistribution of 
acidics and basics. 

15 Consider the distribution of amino acids encoded by 

one codon in a population of vgDNA. Let Abun(x} be the 
abundance of DNA sequences coding for amino acid x» For 
any distribution, there will be a most-favored amino acid 
(mfaa) with abundance Abun(mfaa) and a least- favored amino 

20 acid (Ifaa) with abundance Abun(lfaa) • We seek the 
nucleotide distribution that allows all twenty amino acids 
and that yields the largest ratio Abun (Ifaa) /Abun (mfaa) 
subject to ^ two constraints. First, the abundances of 
acidic and basic amino acids ishould be equal. Second, the 

25 niimber of stop codons should be kept as low as possible. 
Thus only nucleotide distributions that yield 



Abun(E)+Abun(D) = Ab\in(R)+Ab\in(K) 



30 are considered, and the function maximized is: 



f (distribution) = 



{ (l-Abxm (stop) ) (Abxin (If aa) /Abun (mfaa) ) } . 



35 



We limit the third base to equimolar T and G (C and G 
would b equivalent) . All amino acids are possible and the 
number of accessible stop codons is reduced. 
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A computer program, "Find Optimum vgCodon." (Teible 
18), varies the composition at bases 1 and 2, in steps of 
0.05, and rep rts the composition that gives the largest 
value of f (distribution) subject to the constraints: 

5 

g2 = (gl*a2 - 0.5*al*a2)/{cl + 0.5*al), 
tl = 1 - al - ci - gl, and 
t2 = 1 - a2 - c2 - g2 

10 The first constraint requires equal amount of acidic and 
basic amino acids and the second and third conserve 
matter. 

We vary al, cl, gl, a2, £ind c2 and then calculate tl, g2, 
15 and t2. Initially, variation is in steps of 5%. Once an 
approximately optimum distribution of nucleotides is 
determined, the region is further explored with steps of 
1%. The optimum distribution is: 

20 OptjLrouin vgCo4oT> 







T 


C 


A 




base 


#1 = 


0.26 


0.18 


0.26 


0.30 


base 


#2 » 


0.22 


0.16 


0.40 


0.22 


25 base 


#3 = 


0.5 


0.0 


0.0 


0.5 



and yields DKA molecules encoding each type of amino acid 
with the abundances shown in Table 19. 

30 The actual nucleotide distribution obtained in 

synthetic DNA will differ from the specified nucleotide 
distribution due to several causes, including: a) differen- 
tial inherent reactivity of nucleotide substrates, and b) 
differential deterioration of reagents. It is possible to 

35 compensate partially for these effects, but some residual 
error will occur. We denote the average discrepancy 
between specified and observed nucleotide fraction as S^^rr' 
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Serr = square root ( average [ (f^bs " fspec)/fspec 3 ) 

where fobs amount of one type f nucleotide found at 

a base and fspec the amount of that type of nucleotide 
5 that was specified at the same base. The average is over 
all specified types of nucleotides and over a number ( e,a, 
10 to 50) of different variegated bases. By hypothesis, 
the actual nucleotide distribution at a variegated base 
will be within 5% of the specified distribution. Actual 
10 DNA synthesizers and DNA synthetic chemistry may have 
different error levels. It is the user's responsibility to 
determine S^rr ^^r the DNA synthesizer and chemistry 
employed by the user. 

15 To determine the possible effects of errors in 

nucleotide composition on the amino acid distribution, we 
modified the program "Find Optimum vgCodon" in f ovir ways : 

1) the fraction of each nucleotide in the first two bases 
20 is allowed to vary from its optimum value times (1- 

^err) '^^^ optimum value times (1 + S^^^) in seven 
equal steps (Serr the hypothetical fractional error 
level) ^ maintaining the s\im of nucleotide fractions 
for one codon position at 1.0, 

25 

2) g2 is varied in the same manner as a2, i.e. we dropped 
the restriction that Abiin(D) + Abun{E) = Abun(K) + 
Abun(R), 

r 

30 3) t3 and g3 are varied from 0.5 times (1 - S^^^) to 0.5 
times (1 + S^rr) ^ three equal steps, 

4) the smallest ratio Abun(lfaa)/Abun(mfaa) is sought. 

35 In actual experiments, we direct the synthesizer to 
produce the optimum DNA distribution "Optimum vgCodon" 
given above. Incomplete control over DNA chemistry may, 
however, cause us to actually obtain the following distri- 
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bution that is the worst that can be obtained if all 
nucleotide fractions are within 5% of the amounts specified 
in "Optimiun vgCodon". A corresponding table can be 
calculated for any given S^rr using the program "Find worst 
5 vgCodon within of given distribution." given in Table 

20. 

Optimum vaCodon. worst 5% errors 

10 T e A G 

base #1 = 0.251 0.189 0.273 0.287 

base #2 » 0.209 0.160 0.400 0.231 

base #3 = 0.475 0.0 0.0 0.525 

15 This distribution yields DNA encoding each of the twenty 
amino acids at the abundances shown in Table 21. 

s 

Each codon synthesized with the distribution of bases 
shown above displays 4x 4 x 2 = 2^ - 32 possible DNA 

20 sequences, though not in equal abmidances. An oligonucleo- 
tide containing N such codons would display 2^^ possible 
DNA sequences and would encode 20^ protein sequences. 
Other variegation schemes produce different numbers of DNA 
and protein sequences. For example, if two bases in one 

25 codon are varied through two possibilities each, then there 
are 2x2^4 DNA sequences and 2x2=4 protein sequen- 
ces. 

If five codons are synthesized with reagents mixed so 
30 as to produce the nucleotide distribution "Optimum vg- 
Codon", and if we actually obtained the nucleotide distri- 
bution "Optimum vgCodon, worst 5% errors", then DNA 
sequences encoding the mfaa at all of the five codons are 
about 277 times as likely as DNA sequences encoding the 
35 Ifaa at all of the five codons. Further, 5d3out 24% of the 
DNA sequences will have a stop codon in one or more of the 
five codons. 
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Consider variegation of a hypothetical sequence, F24- 
G25-D26-E27-T28, in Which each variegated codon is synthe- 
sized as an "Optimal vgCodon". The actual abundance of the 
DNA encoding each type of amino acid is, however, taken 
5 from the case of S^^j- 5% given in Table 21. The abun- 
dance of DNA encoding the parental amino acid sequence is: 
Amount (parental seq. ) 

F24 G25 D26 E27 T28 

= Abun(F) * Abun(G) * Abun(D) * Abun(E) * Abun(T) 
10 = .0249 X .0663 X .0545 X .0602 X .0437 

=2.4 X 10*7 

Therefore, if the efficiency of the entire process allows 
us to examine 10^ different DNA sequences, DNA encoding the 
15 parental DBF sequence as well as very many related sec[uen- 
ces will be present in sufficient quantity to be detected 
and we are asstired that the process will be progressive. 



20 



Setting level of variegation! 

We use the following procedure to determine whether a 
given level of variegation is practical: 

1) from: a) the intended nucleotide distribution at each 
25 base of a variegated codon, and b) S^j^j^ (the error 

level in mixed DNA synthesis) , calculate the abundan- 
ces of DNA sequences coding for each amino acid and 
stop, 

30 2) calculate the abundance of DNA encoding the parental 
DBF sequence by multiplying the abundances of the 
parental amino acid at each variegated residue. 



35 The abundsuices used in the procedure above are calculated 
from the worst distribution that is within S ^r 
specified distributi n. A variegation that insures that 
the parental DBF sequence can be recovered is practical. 
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Such a level of variegation produces an normous number of 
multiple changes related t the parental DBF available for 
selection of improved successful DBFs. We adjust the 
sxibset of residues to be varied and levels of variegation 
5 at each residue until the calculated variegation is within 
bounds. 

Reduction of gratuitous restriction sites: 

10 If the method of mutagenesis to be used is replacement 

of a cassette, we consider whether the variegation gener- 
ates gratuitous restriction sites. We reduce or eliminate 
gratuitous restriction sites by appropriate choice of 
variegation pattern and silent alteration of codons 

15 neighboring the sites of variegation. 

Focused mutagenesis: 

In the preferred embodiment of this process^ the 
20 number of residues and the range of variation at each 
residue are chosen to maximize the number of DNA binding 
surfaces, to minimize gratuitous restriction sites, and to 
assure the recovery of the initial DBF sequence. For 
example, in Detailed Example 1, the initial DBF is X Cro. 
25 One primary set of residues includes G15, Q16, K21, ¥26, 
Q27, S28, N31, K32, H35, A36, and R38 of the H-T-H region 
(Table 14b) and C-terminal residues K56, N61, K62, K63, 
T64, T65, and A66. A secondary set of residues includes 
L23, G24, and V25 froia the turn portion of the H-T-H 
30 region, buried residues T20, A21, A30, 131, A34, and 135 
from alpha helices 2 and 3, and dimerization region 
residues E54, V55, F58, P59, and S60. 

The initial set of 5 residues for Focused Mutagenesis 
35 contains residues in or near the N-terminal half of alpha 
helix 3: Y26, Q27, S28, K31, and K32. Varying these 5 
residues through all 20 amino acids produces 3.2 x 10^ 
different protein sequ nces encoded by 32^ (=3.3 x 10^) 
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different: DNA seguenc s. Since all 5 residues are in the 
S2une interaction set, this variegation ischeme produces the 
maximum niimber of different surfaces. Assming optimized 
nucleotide distribution described above and Serr " 
5 probability of obtaining the parental sequence is 3,2 x 
10""^. This level is within boxinds f or synthesis, ligation, 
transformation, and selection capable of examining 10^ 
sequences of vgDNA. Codons for the 5 residues picked for 
Focused Mutagenesis are contained in the 51 bp PpuMI I to 
10 Bal ll fragment of the rav' *' gene constructed in Detailed 
Example 1. 

Repetition to obtain desired degree o f DNA-bindino! 

15 The first variegation step can produce one or more 

DBFs having DNA-binding properties that are satisfactory to 
the user. If the best selected DBF is not fully satisfac- 
tory, parental DBFs for a second variegation step are 
picked from DBFs isolated in the first variegation step. 

20 The second and subsequent variegation steps may employ 
either Focused or Diffuse Mutagenesis procedures on 
residues of the primary or secondary sets. In the prefer- 
red embodiment of this process, the user chooses residues 
and mutagenesis procedures based on the structure of the 

25 parental DBF and specific goals. For example, consider 
three hypothetical cases. 

In a first case, a variegation step produces a DBF 
with greater non-specific DNA binding than is desired. 

30 Information from sequence analysis and modeling is used to 
identify residues involved in sequence independent inter- 
actions of the DBF with DNA in the non-specific complex . 
In the next variegation step, some or all of these resi- 
dues, together with one or more additional residues from 

3S the primary set, are chosen for Focused Mutagenesis and 
additional residu s from the primary or sec ndary sets are 
chosen for Diffuse Mutagenesis. 
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In a second hypothetical case, a variegation step 
produces a DBF with strong sequence specific binding to 
the target and the goal is to optimize binding. In this 
case, the next variegation step employs Diffuse Mutagenesis 
5 of a large number of residues chosen mostly from the 
secondary set. 

In the third hypothetical case, a DBP has been 
isolated that has insufficient binding properties. A set 
10 of residues is chosen to include some primary residues that 
have not been svibjected to variation, one or more primary 
residues that have been varied previously, and one or more 
secondary residues. Focused Mutagenesis is performed on 
this set in the next variegation step. 

15 

Overview; DNA synthesis. Purification, and Cloning 
DNA secruence design: 

20 The present invention is not limited to a single 

method of gene design. The idbp gene need not be synthe- 
sized In toto ; parts of the gene may be obtained from 
nature. One may use any genetic engineering method to 
produce the correct gene fusion, so long as one can easily 

25 and accurately direct mutations to specific sites. In all 
of the methods of mutagenesis considered in the present 
invention, however, it is necessary that the DNA sequence 
for the idbp gene be unique compared to other DNA in the 
operative cloning vector. If the method of mutagenesis is 

30 to be replacement of subsequences coding for the potential- 
DBP with vgDNA, then the subsequences to be mutagenized 
must be bounded by restriction sites that are unique with 
respect to the rest of the vector. If single-stranded 
oligonucleotide-directed mutagenesis is to be used, then 

35 the DNA sequence of the subsequence coding, for the initial 
DBP must be uniqpie with resp ct to the rest of the vector. 

The coding portions of genes to be synthesized are 
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designed at the protein level and then encoded in DNA, 
The amino acid sequences are chosen to achieve various 
goals, including: a) expression f initial DBP intra- 
cellular ly, and b) generation of a population of potential- 
5 DBFs from which to select a successful DBP« The ambiguity 
in the genetic code is exploited to allow optimal placement 
of restriction sites and to create various distributions of 
amino acids at variegated codons. 

10 Organization of gene synthesis; 

The present invention is not limited as to how a 
designed DNA sequence is divided for easy synthesis. An 
established method is to synthesize both strands of the 

15 entire gene in overlapping segments of 20 to 50 nucleotides 
(THER88). An alternative method that is more suitable for 
synthesis of vgDNA is similar to methods published by 
others (OLIP86, 0LIP87, AIJSU87, KARN84) . Contrary to most 
previous workers^ we: a) use two synthetic strands, and b) 

20 do not cut the extended DNA in the middle. Our goals are: 
a) to produce longer pieces of dsDNA than can be syn- 
thesized as ssDNA on commercial DNA synthesizers, and b) 
to produce strands complementary to single-stranded vgDNA. 
By using two synthetic strands, we remove the requirement 

25 for a palindromic sequence at the 3' end. Moreover, the 
overlap should not be palindromic lest single DNA molecules 
prime themselves. 

The present invention is not limited to any particular 
30 method of DNA synthesis or construction. Preferably, DNA 
is synthesized on a Milligen 7500 DNA synthesizer (Mil- 
ligen, Bedford, MA) by standard procedures. Synthetic DNA 
is purified by polyacrylamide gel electrophoresis (PAGE) or 
high-pressure liquid chromatography (HPLC) • The present 
35 invention is not limited to any particular method of 
purifying DNA for genetic engineering. 
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IDBP Gene cloning t 

We clone the Idbp g ne using plasmids that are trans- 
formed into competent bacterial cells by standard methods 
5 (MANI82) or slightly modified standard methods. DNA 
fragments derived from nature are opereOjly linked to other 
fragments of DNA. 

Cells transformed with the plasmid bearing the 
10 complete idbp gene are tested to verify expression of the 
initial DBF. Selection for plasmid presence is maintained 
on all media ^ while selections for DBP"*" phenotypes are 
applied only after growth in the presence of inducer 
appropriate to the promoter. Colonies that display the 
15 DBP"*" phenotypes in the presence of inducer and DBP" 
phenotypes in the absence of inducer are retained for 
further genetic and biochemical characterization. The 
presence of the idbp gene is initially detected by restric- 
tion enzyme digestion patterns characteristic of that gene 
20 and is confirmed by sequencing. 

The dependence of the IDBP"*" and IDBP" phenotypes on 
the presence of this gene is demonstrated by additional 
genetic constructions. These are a) excision of the idbp 

25 gene by restriction digestion and closure by ligation, and 
b) ligation of the excised idbp gene into a plasmid 
recipient carrying different markers and no dbp gene. 
Plasmids ' obtained by excising the gene confer the DBP" 
phenotypes fe,a, Tc^, Fus^, and Gal^ in Detailed Example 

30 1) . Plasmids obtained from ligation of idbp to a recipient 
plasmid confer the DBP'*^ phenotypes in the presence of an 
inducer appropriate to the regulatable promoter ( e.g. Tc^, 
Fus^, and Gal^ in Detailed Example 1). Finally, a most 
important demonstration of the successful construction 

35 involves determination of the quantitative dependence of 
the selected phenotypes on the exogenous inducer concentra- 
tion. 
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Qvervlev: DNA-bindihg Protein Purification and 
Characteriza-tion 

Isolation of IDBP; 

5 

We purify IDBP and its derivatives by standard 
methods, such as those described in JOHNSO, TAKE86, LEIG87, 
VERS85bv KAD08 6 • 

10 Quantitation and characterization of pro tein-DNA binding; 

Methods that can be used to quantitate and char- 
acterize sequence-specific and sequence- independent binding 
of a DBP to DNA include: a) filter-binding assays, b) 

15 electrophoretic mobility shift analysis, and c) DNase 
protection experiments, Ionic strength, pH, and tempera- 
ture are important factors influencing DBP binding to DNA. 
Standard conditions should correspond closely to the 
anticipated conditions of use. Thus, if a binding protein 

20 is intended for use in bacterial cells in standard culture, 
a reasonable range of values from which to choose standard 
conditions would be: pH«7.5 to 8.0, 0.1 to 0.2 M KCl, and 
32^ to 37^C. Assay buffers preferably include cof actors, 
stabilizing agents, emd counter ions for proper DBP 

25 fTinction. 

We prepare DNA fragments for analysis of protein-DNA 
binding by methods that are very similar to those described 
in MAXA77, KLEN70, RIGB77, and KIMJ87. Filter-binding 

30 assays can yield thermodynamic (K^) and kinetic (k^ and 
kji) constants and are performed by methods similar to those 
described by RIGG70, and KIMJ87. Electrophoretic mobility 
shift measurements can also yield values of Kj), k^, and k^ 
and are performed by methods similar to those of FRIE81. 

35 DNase protection assays use the methods of JOHN79, MAXA77, 
FOXK88. We use chemical methods to characterize binding 
of proteins to DNA similar to the methods described in 
BRUN87, BUSH85, and JENJ86. 
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Table of Examples 

Ex, 1 Protocol for developing a new DNA-binding protein 

with affinity for a DNA-seguence found in HIV-l, 
5 by variegation of \ Cro. 

Ex> 2 Protocol for developing a new DNA-binding 
polypeptide with affinity for a DNA-sequence 
found in HIV-1, by variegation of a polypeptide 
10 having a segment homologous with Phage P22 Arc. 

Ex, 3 Use of a custodial domain (residues 20-83 of 
barley chymotrypsin inhibitor) to protect a DNA- 
binding polypeptide from degradation. 

15 

Ex. 4 Use of a custodial domain containing a DNA- 
recognizing element (alpha-3 helix of Cro to 
protect a DNA-binding polypeptide from degrada- 
tion. 

20 

Ex. 5 Protocol for addition of arm to Phage P22 ARc to 

alter its DNA-binding characteristics. 

Ex. 6 Protocol for preparation of novel DNA-binding 
25 protein that recognizes an asymmetric DNA 

sequence and corresponds to a fusion of third 
zinc-finger domain of the Drosophila kr gene 
product and the DNA-binding domain of Phage P22 
Arc, 



30 



*** .... 



DETAILED EXMCPI£ 1 

35 Below is a hypothetical example of a protocol for 

developing a new DNA-binding protein derived from \ Cro 
with affinity for a DNA s quence found in h\iman immunodefi- 
ciency virus type 1 (HIV-1) using coli K-12 as the cell 
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line or strain. Further optimization/ in accordance with 
the teachings herein, may be necessary to obtain the 
desired results. Possible modifications in the preferred 
method are discussed following various steps of the 
5 example. 

By hypothesis, we set the following technical capabil- 
ities: 

10 Yield from DNA synthesis 

500 ng/synthesis of ssDNA 100 bases long, 

10 ug/synthesis of ssDNA 60 bases long, 
1 mg/synthesis of ssDNA 20 bases long« 

15 Maximvim oligonucleotide 

100 bases 

Yield of plasmid DNA 
1 mg/1 of culture mediiim 

20 

Efficiency of DNA Ligation 
0.1 % for blunt-blunt, 

4 % for sticky-blunt, 

11 % for sticky-sticky. 

25 

Yield of transf ormants 

5 X 10^ / ug DNA 

Error in mixed DNA synthesis (Serr) 

30 5% 



Choice of cell line or strain: 

35 In this example / the following coli K-12 recA 

strains are used: ATCC #35,882 delta4 (Genotype: W3 110 
^PPC, recA, ypsX^, susP, ae;Lta4 faal-chlD-pal-attiambda^ 
and ATCC #33,694 HBlOl (Genotype: F"", leuB , proA . recA . 
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thi , ara, lacY , aalK , x^, ffitl, rpsL . supB, h^qg, (Sb""' IHb" 
) . coli K-12 strains are grown at 37^0 in LB broth 

(MANI82, p440) and on LB agar (addition of 15 g Bacto-agar) 
for routine purposes. Selections for plasmid uptake and 
5 maintenance are performed with addition of ampicillin (Ap) 
(200 xxg/nl) , tetracycline (Tc) (12.5 ug/ml) and kanamycin 
(Km) (50 ug/ml) . 
Choice of initial DBF; 

10 The initial DBF is X Cro. Helix-tum-helix proteins 

are preferred over other known DBFs because more detail is 
known about the interactions of these proteins with DNA 
than is known for other classes of natural DBF. \ Cro is 
preferred over X repressor because it has lower molecular 

15 weight. Cro from 434 is smaller than X Cro, but more is 
known about the genetics and 3D structure of \ Cro. An X- 
ray structure of the X Cro protein has been published, but 
no X-ray structure of a DKA-Cro complex has appeared. A 
mutant of X Cro, Cro67, confers the positive control 

20 phenotype iji vitro but not in vivo. The contacts that 
stabilize the Cro dimer are known, and several mutations in 
the dimerization function have been identified (FAKU86) . 

By the methods disclosed herein, DBFs may be developed 
25 from Cro which recognize DNA binding sites different from 
the X 0^2 or X operator consensus binding sites, including 
heterodimeric DBFs which recognize non-symmetric DNA 
binding sites. 

30 

Selections for t)henotvpes conferred bv DBP"^ function: 

Media generally are supplemented with IPTG and 
antibiotic for selection of plasmid maintenance. Cell 
35 background is generally strain delta4 ( aalK.T.E deletion) . 

a. Galactose resistance (Gal^) . Galactose epimerase 
deficient (gall") strains of ]^ coli (BUTT63) lyse when 
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treated wit:h galactose. Sel ctive laedluin is supplemented 
with 2% galactose, added after autoclaviiig. Additional 
galactose, up to 8%, somewhat reduces the background of 
artif actual galactose-^sensitive colonies. 

5 

b. Galactose resistance selected immediately after 
transformation. Inducer IPT6 is added to transformed 

. cells, to 5 X 10"* M at the start of the growth period, 
that allows expression of plasmid antibiotic-resistance. 

10 At 60 min after heat shock, cells are further diluted 10- 
fold into fresh LB broth containing IPTG, antibiotic to 
select for plasmid uptake f e.a. Ap or Km) , and 2% galac- 
tose. Cells are grown until lysis is complete or for 3 h, 
whichever occurs first, then centrifuged at 6,000 rpm for 

15 10 min, resuspended in the initial voltime of the post- 
transformation growth culture, and applied to medium for 
further selection. 

c. Fusaric acid resistance (Tc^, Fus^) . Successful 
20 repression of tet yields resistance to lipophilic chelating 

agents such as fusaric acid ( Fus^ phenotype ) . Medium 
described by MAL081 is used for selection of fusaric acid 
resistance in coli ; the amount of fusaric acid may be 
varied. Total cell inoculum is not greater than 5 x 10^ 
25 per plate. 

d. Fusaric acid resistance and galactose resistance. 
Galactose at a final concentration of 2% is added to the 
medium described by MAL081 after autoclaving. Cells 

30 selected directly for galactose resistance in liquid 
following transformation are applied to this mediiim. 

Selections for phenotyoes conferred bv DBP" function: 

35 Cell background is generally strain HBlOl f aalK ^V > 

Media are generally suppl mented with IPTG and antibiotic 
for plasmid maintenance. 
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a. Tc resistance. Medium, usually LB agar, is supple- 
mented with Tc aifter autoclaving. Tc stock solution is 
12.5 mg/ml in ethanol. It is stored at -20° C, wrapped in 
aliminuia foil. Petri plates containing Tc are also wrapped 

5 in foil. Minimuia inhibitory concentration is 3.1 ug/ml 
using a cell inoculum of 5 x 10^ to 10^ per plate. More 
stringent selections employ upto 50 ug/ml Tc. When used 
for selection of plasmid maintenance, Tc concentration is 
12.5 ug/ml. 

10 

b. Galactose utilization. Minimal A Medium (MILL72, 
p432}, with galactose as carbon source: after autoclaving 
add (per liter) 1 ml 1 M MgS04, 0.5 ml of 10 mg/ml thiamine 
HCl, 10 ml of 20% galactose, and amino acids as required. 

15 Cell inoculum per plate is less than 5 x 10^. 

c. Tc resistance and galactose utilization. Medium A with 
galactose (section b. above) is supplemented with Tee at 
3.1 ug/ml. 

20 

Selectable systems for DB P isolation; 

The tet gene from pBR322 and the E^. coli qalT.K genes 
25 are used in a aal deletion host strain for selection of DBP 
function. pKK175-6 (BROS84; Pharmacia, Piscataway, NJ) , a 
pBR322 derivative, contains the replication origin, bla 
(confers Ap^) for selection of plasmid maintenance, and 
tet, one of the two selectable genes (Figure 3.) In 
30 pKK175-6, tet is promoterless, and all DNA upstream of the 
pBR322 tet coding region that potentially allow transcrip- 
tion in both directions (BR0S82) have been deleted and 
replaced by the M13 mp8 polylinker. The polylinker and tet 
are flanked by strong transcription terminators from iB^. 
35 coli rmB . tet is placed under control of the Tn5 neo 
promoter, Ppeo- 
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Plasmid pAA3H (figure 4) (ATCC #37,308) (AHME84) 
provides the second set of selectable genes, aalT,K . In 
gal deleted hosts (such as strain ATCC #35,882 carrying the 
delta4 deletion (E^ eoli delta4 ) ) plasmid pAA3H confers 
5 the Ap^ Tc^ (Sal^ phenotype (AHME84) because part of qalE 
is deleted. The aalT and qalK genes in pAA3H are tran- 
scribed from the "antitet" promoter (BR0S82) . In E, " 
coli strains carrying aalT or aalK mutations f e.q, strain 
HBlOl) , pAA3H confers Gait. We place oalT^ and qallC *" under 
10 control of the pBR322 amp gene promoter. 

For both tet and aal systems, positive selections are 
used to select cells that either express or do not express 
these genes from cultures containing a vast excess of cells 
15 of the opposite phenotype. 

Placement of test DNA binding sequence: 

The test DNA binding sequence for the IDBP, X Or3 
20 (KIMJ87) , is placed so that the first 5' base is the +1 
base of the mRNA transcribed in each of the tet and gal 
transcription iinits (Table 100 and TzdDle 101) • 

Engineer inq the idbo gene: 

25 

A DNA sequence encoding the wild-type Cro protein is 
designed such that expression is controlled by the lacUV5 
promoter. The DNA sequence departs from the wild-type cro 
gene sequence by the introduction of restriction sites. 
30 Thus, the gene is called £§2. The transcriptional unit 
comprising PlacUVS, rav, and trpA terminator is shown in 
Table 102. 

Vector construction; 

35 

The construction of an operative cloning vector is 
summarized in Figure 5. The gal region f pAA3H requires 
manipulation before insertion into pKK175-6. First the X* 
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derived DNA between Hpa l and p coRI is replaced with a Cla l 
linker (New England BioLabs, #1037) . Standard methods are 
used and the resulting plasmid is named pEPlOOl (Figure 6) • 
All plasmids cited in the present application are cata- 
5 logued in Table 103. 

Next, we insert a synthetic fragment, shown in Table 
104, comprising the phage £d terminator and two restriction 
sites (Spel and Sfi l) into the Cla l site of pEPlOOl; the 

10 resulting plasmid is named p£P1002 (Figure 7} . Next, we 
replace the P^ promoter upstream of gal with £amp ^^om 
PBR322. As shoen in Table 100, X Or3 is positioned 
downstream of £amp that it can be used to determine 
whether binding of Gro can prevent transcription of aalT,K . 

15 Restriction sites are provided to allow later alteration of 
the target sequence. The synthetic fragment is cloned into 
pEPlb02 between S^^lll and BamH I. The resulting plasmid is 
named pEP1003 and confers Gal^ on delta4 cells. 

20 The aal genes with the promoter and the £^ terminator 

are moved from pEP1003 into pKK175-6. The 2.69 kb calT.K - 
bearing Hpa l fragment of pEP1003 is ligated to DNA obtained 
from PKK175-6 by partial Dra l digestion. Gal"*" colonies of 
transformed HBlOl cells are picked. The resulting plasmid 

25 is named pEP1004 (Figure 9). 

The Tn5 neo gene promoter and Oj^3 are synthesized 
(Table 101) and inserted upstream of the tet coding region 
of pEP1004 between the unique Hin dlll and Sma l sites. 

30 Plasmid DNA from Ap^ Tc^ Gal^ colonies of transformed 
delta4 cells is analyzed for an insert in the Eco RI- Eco RV 
fragment of pEP1004. The resulting 7.1 kb plasmid, with 
two separate selectable gene systems under control of two 
different promoters and the test DNA binding sequence, is 

35 designated pEPlOOS (Figure 10) . 
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pionina the idbp gene; 

The BamH I site in the tet gene is removed from the tet 
gene in pEPlOOS by site-directed mutagenesis; the secjuence 
5 TGG-ATC-CTC that codes for W97-I98-L99 is changed to TGG- 
ATA-TTG, DNA from pEPlOOS is linearized with EcoR V and 
part ( ca, 10%) of the DNA is made single stranded with 
exonuclease III. The mutagenic oligonuclotide shown in 
Table 105 is aimealed to the DNA that is then completed 
10 with Klenow enzyme and ligated. Plasmid DNA from Tc**^, Gal'*' 
colonies of transformed HBlOl is analyzed by standard 
means; the resulting plasmid is named pEP1006. 

Synthetic DNA containing a Spe l overhang ^ followed by 
15 sequences for the lacUVS promoter, a ribosome binding site, 
cloning sites for idbp , the troa terminator (R0SE79) ^ and 
an Sfi l restricted end complementary to the Sfi l site in 
pEPlOOe is synthesized as six oligonucleotides as shown in 
Table 107 • We usb the methods of THER88 to anneal and 
20 ligate these fragments into Spe l, Sfi l cut pEPlOOS. 
Plasmid DNA from Ap^, Tc^, Gal^ colonies of transformed 
delta4 cells is examined for the Spe l -Sfi l insertion by 
restriction with Sfiel, SstEII, Bglll, Konl. and Sfil. The 
inserted DNA is verified by DNA sequencing, and the 7.22 kb 
25 plasmid containing the proper insertion is designated 
pEP1007, shown in Figure 11. 

The idbp gene sequence specifying the Cro"*" protein 
and designated rav in this Example, is inserted in two 

30 cloning steps. The BstEII-Bglll segment of rav (Table 109) 
is inserted first. Oligonucleotides olig#14 and .olig#15 
are synthesized, annealed, and filled in with Klenow 
enzyme f Cf , KAEN84) . The dsDNA is cut wit Bg£EII and Bglll 
and ligated to BstEII -Bal ll cut pEP1007. The plasmid 

35 containing the appropriate partial rav sequence is desig- 
nated PEP1008. 
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The Bal ll -Kpn l fragment of rav is synth sized and 
inserted in the same manner as the BstE II- Bal ll fragment. 
(See Table 110.) This plasmid carrying the complete rav 
gene is designated p£P1009, shown in Figure 12. 

5 

Determine whether IDBP is expressed; 

To determine whether cells carrying p£Pl009 display 
the phenotypes expected for rav expression, the delta4 
10 strain bearing p£Pl009 is tested on various Ap containing 
selective media with, and without IPTG. Cells are streaked 
on LB agar media containing: a) Tc; b) fusaric acid; or c) 
galactose ( vide supra 1 . Control strains are the delta4 
host with no plasmid, and with pEPlOOS, pBR322, or pAA3H. 

15 

The results below indicate that the rav gene is 
expressed and the gene product is functional, and that 
expression is regulated by the lacUVS promoter. 

20 Growth of derivatives of strain delta4 



on selective media f-f Ap^ 



supplements: 


tetiracyc^ine 


fuparic a^gX^ q^3,^<?t:9s? 


TPTG: 


+ 


+ - + 


25 plasmid: 






PBR322 
pAA3H 
PEP1009 
30 pEPlOOS 


+ + 


1 + + 1 

1 + 1 + 1 
1 11 + 1 



X cl " phage is streaked on each of the above strains, 
on LB agar with Ap, and with and without IPTG. At suffi- 
ciently high intracellular levels of Cro protein, binding 
35 of the Cro repressor protein to the \ phage operators Op 
and Ol prevents phage growth. Data indicating correct 
expr ssion and function of the rav gene are: 
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Growth of X cl" on de^ta4 cells 



nlasmid 


tjhaae crrowth 


+TPTG -TPTG 




+ + 


PEP1009 


+ 




+ + 



These procedures indicate that the chosen IDBP, the 
10 product of the rav gene, is expressed and is successfully 
repressing both the test operators on the plasmid and the 
wild type operators on the challenge phage. 

DBP purification; 

15 

Proteins are purified as described by Leighton and Lu 
(LEIG87). 

Quantitation of DBP binding; 

20 

We measure DBP binding to the target operator DNA 
sequence with a filter binding assay, initially using 
filter binding assay conditions similar to those described 
for \ Cro (KIHJ87} . Data are analyzed by the methods of 
25 RIGG70 and KIMJ87. 

The target DNA for the assay is the 113 bp . Apa l -Rsa l 
fragment from plasmid pEP1009 containing X ^b?* ^ control 
DNA fragment of the same size, used to determine non- 
30 specific DNA binding, contains a synthetic Apa l -Xba l DNA 
fragment specifying the amp promoter and the sequence 

5» CTTATACACGAAGCGTGACAA 3« • 

35 This sequence preserves the base content of the 0^3 
sequence but lacks several sites f conserved sequence 
required for X Cro binding (KIMJ87) and is cloned between 
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the Apa l ^Xba l sites of the pEP1009 backbone to yield 
pEPlOlO. 

Media Formulations; 

5 . 

Gal^ is demonstrable in LB agar and broth at very low 
concentrations (0.2% galactose), and is optimal at 2 to 8% 
galactose. Galactose and Tc selections are performed in LB 
medium. Fus^ is best achieved in the medium described by 
10 Maloy and Nuhn (MAL081) for ^ coli K-12 strains. 

Induction of DBP expression: 

The pdbp gene is regulated by the lacUVS promoter. 
15 Optimal induction is achieved by addition of IPTG at 5 x 
10"^ M (MAURBO) . Experimentation for each successful DBP 
determines the lowest concentration that is sufficient to 
maintain repression of the selection system genes. 

20 Optimization of selections 

For each selective medium used to detect IDBP function, 
factors are varied to obtain a maximal number of transform- 
ants per plate and with a minimal number of false positive 
25 artif actual colonies. Of greatest importance in this 
optimization is the transcriptional regulation of the 
initial potent ial*-DBP, such that in further mutagenesis 
studies, ^ novo binding at an intermediate affinity is 
compensated by high level production of DBP. 

30 

Regulation of IDBP; 

Cells carrying pEP1009 are grown in LB broth with IPTG 
at 10"^ 5 X lO*'^, 10"5^ 5 X 10"5 iq-^ ^nd 5 x 10"* M. 
35 Samples are plated on LB agar and on LB agar containing 
fusaric acid or galactose as described in above. All media 
contain 200 ug/ml Ap, and the IPTG concentrati n of the 
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broth culture media are maintained in the respective 
selective agar media. 

The IPTG concentration at which 50% of the cells 
5 survive is a measure of affinity between IDBP and test 
operator, such that the lower the concentration, the 
greater the affinity. A requirement for low IPTG, e.g. 
10"^ M, for 50% suirvival due to Rav protein function 
suggests that use of a high level, e.a> 5 x 10*"^ M IPTG, 
10 employed in selective media to isolate mutants displaying 
de novo binding of a DBP to target DHA, will enable 
isolation of successful DBPs even if the affinity is low. 

Cpngfintr^ttioi^ ot s^i^c^^v^ ^^^nt^ an<^ q^XI inoc?Mlm sjze: 

15 

Fusaric acid and galactose content of each medium is 
varied, to allow the largest possible cell sample to be 
applied per Petri plate. This objective is obtained by 
applying samples of large numbers of sensitive cells f e,a. 

20 5 X lo"^, 10®, 5 X 10®) to plates with elevated fusaric acid 
or galactose. Resistant cells are then used to determine 
the efficiency of plating. An acceptable efficiency is 80% 
viability for the resistant control strain bearing pEPl609 
in a delta4 background. The total cell inoculum size is 

25 increased as is the level of inhibitory compound until 
viability is reduced to less than 80%. 

Choice and cloning of target sequences: 

30 Sequences of the htiman immunodeficiency virus type 1 

(HIV-1) genome were searched for potential target sequen- 
ces. The known sequences of isolates of HIV-1 were 
obtained from the GENBANK version 52.0 DNA sequence data 
base. First we found non-variable regions of HIV-1. We 

35 examined the HIV-1 genome from the TATA sequence in the 
5»LTR of the HIV-1 g nome to the end of the sequence coding 
for the tat and trs second exons. We intented to locate 
non-variable regions wher a DBP can interfere with the 
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production of tat and/or trs mRNA because the products of 
these genes are essential in production of virus (DAYT86, 
FEIN86) . 

5 HIV-1 isolate HXB2 (RATN85) from nucleotide number 1 

through 6100 is the reference to which we aligned all other 
HIV-l isolates using the Nucleic Acid DateUsase Search 
program (derived from FASTN (LIPM85)) in the IBI/Pustell 
Sequence Analysis Programs software package (International 
10 Biotechnologies, Inc., New Haven, CT) • All stretches of at 
least 20 bases which have no variation in sequence among 
all HIV-l isolates were retained as targets. 

From the alignment, segments of the HIV-1 isolate HXB2 
15 sequence that are non-variable among all HIV-l sequences 
searched are: 



350 




371, 


519 




545, 


623 




651, 


679 




697, 


759 




781, 


783 




805, 


1016 




1051, 


1323 




1342, 


1494 




1519, 


1591 




1612, 


1725 




1751, 


1816 




1837, 


2067 




2094, 


2139 




2164, 


2387 




2427, 


2567 




2606, 


2615 




2650, 


2996 




3018, 


3092 




3117, 


3500 




3523, 


3866 




3887, 


4149 




4170, 


4172 




4206, 


4280 




4302, 


4370 




4404, 


4533 




4561,. 


4661 




4695, 


4742 




4767, 


4808 




4828 > 


4838 




4864, 


4882 




4911, 


4952 




4983, 


5030 




5074, 


5151 




5173, 


5553 




5573, 


5955 




5991 



In the present Example, these potential regions were 
searched for subsequences matching the central seven base 

30 pairs of the \ operators that have high affinity for X Cro 
( viz, Ojj3, the symmetric consensus, and the Kim et al . 
consensus (KIMJ87}). The consensus sequence of Kim al. 
has higher affinity for Cro than does 0j^3 which is the 
natural X operator having highest affinity for Cro. Cro is 

35 thought to recognize seventeen base pairs, with side 
groups n alpha 3 directly contacting the outer four or 
five bases on ach end of the operator. Because the 
composition and sequence of the inner seven base pairs 
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affect the position and flexibility of the outer five base 
pairs to either side, these bases affect the affinity of 
Cro for the operator. 

5 The sequences sought are shown in TcU3le 111. The 

letters "A" and "S" stand for antisense and sense. 
"bR3A/Syinm. Consensus. 5" is a composite that has 0|^3A at 
all locations except 5, where it has the symmetric consen- 
sus base, C. Similarly, '*0|(3A/Symm« Consensus. 6" has the 
10 symmetric consensus base at location 6 and Oj^3A at other 
locations. 

A FORTRAN program searched the non-variable HIV-1 
subsequence segments for stretches of seven nucleotides of 

15 which at least five are G or C and which are flanked on 
either side by five bases of non-variable HIV-1 sub- 
sequence. The 427 candidate seven-base-pair subsequences 
obtained using these constraints on CG content were then 
searched for matches to either the sense or anti-sense 

20 strand sequences of the five seven-base-pair subsequences 
listed above. None of the HIV-1 subsequences is identical 
to any of the seven-base-pair subsequences. Three HIV-1 
subsequences, shown in Table 112, were found that match six 
of seven bases. Eight subsequences, shown in Table 113, 

25 were found that match five out of seven bases and that have 
five or more GC base pairs. These HIV-1 subsequences are 
less preferred than the HIV-1 subsequences that match six 
out of seven bases. 

I 

30 11111111 

12345678901234567 

5« aCtT TccGCTaaG GaCt Bases 353-369 

actt tccGCTaa aaaat Left symmetrized 

agtccccGCTggggact Right symmetrized 

3 5 tatcAfiSGCfeagGgata Or3 
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(Lower case letters are palindromic in the two halves of 
the targets and 0^3; highly conserv d bases are bold and 
marked thus a. } 

5 Among the outer five bases of each half operator, bases 1 
and 3 are palindromically related to bases 17 and 15 in 
Target HIV 353-369. 

I 

TCTC GAcGCAaGA CTCG Bases 681-697 

10 tctcgAcGCAgGcgaga Left symmetrized 

cqaa tAcGCAaG actcg Right symmetrized 

tatc AcCGCAAq Gqata Oj^3 

AAA' AAA 

15 None of bases 1-5 are palindromically related to bases 13- 
17 in Target HIV 681-697. 

I 

TTTG AcTAGCGa AGGCT Bases 760-776 

tttg acTAGCGat caaa Left symmetrized 

20 aacc tcTAGCGa aqqct Right symmetrized 

tatc AcCGCAAq Gqata Oj^3 

None of bases 1-5 are palindromically related to bases 13- 
17 in Target HIV 760-776. 

25 

There is extensive sequence variability among the 
twelve phage X operator half -sites. For example: 

I 

tAtCafiSSSJSSStGaTa Consensus 
30 tAtCa CCGCaaG qGaTa Oj^3A 

The bases in lower case in Consensus and 0^3 sequences 
shown above are more variable among various lambdoid 
operators than are bases shown by upper case letters. 
35 Studies of mutant operators indicate that A2 and C4 are 
required for Cro binding. In Target HIV 353-369, bases 
T3, C6, C7, G8, C9, G14, and A15 match the symmetric 
consensus sequence, but the highly conserved A2 and C4 are 
different from lambdoid operators and Cro will not bind to 
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these subsequences. Mutagenesis o£ t:he DNA-contacting 
residues of alpha 3 is thus th first step in producing a 
DBF that recognizes the left symetrized or right sym- 
xaetrized target sequences. 

5 

Target HIV 353-369 is a preferred target because the 
core (underlined above) is highly similar to the Kim et al, 
consensus. Target HIV 760-776 is preferred over Target HIV 
681-697 because it is highly similar to Oj^S. 

10 

The method of the present invention does not require 
any similarity between the target subsequence and the 
original binding site of the initial DBP. The fortuitous 
existence of one or more subsecjuences within the target 
15 genes that has^ similarity to the original binding site of 
the initial DBP reduces the number of iterative steps 
needed to obtain a protein having high affinity and 
specificity for binding to a site in the target gene. 

20 Since the target sequence is from a pathogenic 

organism, we require that the chosen target siibsequence be 
absent or rare in the genome of the host organism, e>a> the 
target subsequences chosen from HIV should be absent or 
rare in the human genome. 

25 

Candidate target binding sites are initially screened 
for their frequency in primate genomes by searching all DNA 
secpiences in the GENBANK Primate directory (2,258,436 
nucleotides) using the IBI/Pustell Nucleic Acid Database 

30 Search program to locate exact or close matches. A similar 
search is made of the ^ coli sequences in the GENBANK 
Bacterial directory and in the sequence of the plasmid 
containing the idbp gene. The sequences of potential sites 
for which no matches are found are used to make oligonucle- 

35 otide probes for Southern analysis of human genomic DNA 
(S0UT75) . Sequences which do not sp cifically bind human 
DNA are retained as target binding sequences. 
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The HIV 353-369 left synanetrlzed and right symmetriz d 
target subsequences are inserted upstream of the selectable 
genes in the plasmid pEPlOOS, replacing the t st sequences, 
to produce, two operative cloning vectors, pEPlOll and 
5 pEP1012, for development of Ravj^ and Ravj^ DBPs. The 
promoter-test sequence cassettes upstream of the tet and 
gal operon genes are excised using Stu I^ Hind lll and Apa l*- 
Xba l restrictions, respectively. Replacement promoter- 
target sequence cassettes are synthesized and inserted into 
10 the vector, replacing Or3 with the HIV 353-369 left or 
right symmetrized target sequence in the sequences shown in 
Table 100 and Table 101. 

Choice of residues in Cro to vary; 

15 

The choice of the principal and secondary sets of 
residues depends on the goal of the mutagenesis. In the 
protocol described here we vary, in separate procedures, 
the residues: a) involved in DNA recognition by the 
20 protein, and b) involved in dimerization of the protein. 
In this section we identify principal and secondary sets of 
residues for DNA recognition and dimerization. 

Pick principal set for DNA-recoanition: 

25 

The principal set of residues involved in DNA-recogni- 
tion is defined as those residues which contact the 
operator DNA in the sequence-specific DNA-protein complex. 
Although no crystal structure of a X Cro-operator DNA 

30 complex is available, a czrystal structure of a complex 
between the structural homolog 434 repressor N-terminal 
domain and a consensus operator has been described 
(ANDE87) . A crystal structure of Cro dimer has been 
determined (ANDES 1) and modeling studies have suggested 

35 residues that can make sequence-specific or sequence- 
independent contacts with DNA in sequence-sp cific com- 
plexes (TAKE83 , OHLE83 , TAKE85 , t;^86} . Isolation and 
characterization of Cro mutants have identified residues 
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which contact DNA in protein-operator complexes (PAKQSe, 
HOCH86a,b, £IS£85) . 

Important contacts with DNA are made by protein 
5 residues in and around the H-T-H region and in the 
terminal region. Hochschild et al> (HOCH86a,b) have 
presented direct evidence that Cro alpha helix 3 residues 
S28r N31, and K32 make sequence-*specific contacts with 
operator bases in the major groove. Mutagenesis experi- 

10 ments (EISE85, PAKU86) and modeling studies (TAKEdS) have 
implicated these residues as well. In addition, these 
studies suggest that H-T-H region residues Q16, K21, Y26, 
Q27, H35, A3 6^ H38, and K39 also make contacts with 
operator DNA. in the C-terminal region, mutagenesis 

15 experiments (PAKU86) and chemical modification studies 
(TAKE86) have identified K56, and K62 as making contacts to 
DNA. In addition, computer modeling suggests that the 5 to 
6 C-terminal amino acids of X Cro can contact the DNA 
along the minor groove (TAKE85} . From these considera- 

20 tions, we select the following set of residues as a 
principal set for use in variegation steps intended to 
modify DNA recognition by Cro or mutant derivative pro- 
teins: 16, 21, 26, 27, 31, 32, 35, 36, 38, 39, 56, 62, 63, 
64v 65, 66. 

25 

pjcH secondary set for DNA recognitjlQns 

The residues in the secondary set contact or otherwise 
influence residues in the principal set. A secondary set 

3d for DNA recognition includes the buried residues of alpha 
helix 3: A29, 130, A33, and 134. Interactions between 
buried residues in alpha helix 2 and buried residues in 
alpha helix 3 are known to stabilize H-T-H structure and 
residues in the turn between alpha helix 2 and alpha helix 

35 3 of H-T-H proteins are conserved among these, proteins 
(PTAS86 pl02). In X Cro these positions are Tl?,^ T19, 
A20, L23, G24, and V25. Changes in the dimerization 
region can influence binding. In \ Cro, residues thought 



wo 90/07862 



PCr/US90/00024 



123 

to be involved in dimer stabilization are £54, V55, and P58 
(TAKE85, PAB084). Finally, residues influencing the 
position of the C-terminal ana of \ Cro are P57, P59, and 
S60. Thus the secondary set of residues for use in variega- 
5 tion steps intended to modify DNA recognition by \ Cro or 
Rav proteins is: 17, 19, 20, 23, 24, 29, 30, 33, 34, 54, 
55, 57, 58, 59 and 60. 

Pick principal set for dt merization: 

10 ^ 

Different principal and secondary sets of residues 
must be picked for use in variegation steps intended to 
alter dimer interactions. In X Cro, antiparallel interac- 
tions between E54, V55, and K56 on each monomer have been 

15 proposed to stabilize the dimer (PAB084) . In addition, F58 
from one monomer has been suggested to contact residues in 
the hydrophobic core of the second monomer. Inspection of 
the 3D structure of X Cro suggests important contacts are 
made between F58 of one monomer and 140, A33, L23, V25, 

20 E54, and A52. In addition, residues L7, 130, and L42 of 
one monomer could make contact with a large side chain 
positioned at 58 in the other monomer. Thus, a set of 
principal residues includes: 7, 23, 25, 30, 33, 40, 42, 52, 
54, 55, 56, and 58. 

25 

Pick secondarv set for dimerization: 

The secondary set of residues for variegation steps 
used to alter dimer interactions includes residues in or 

30 near the antiparallel beta sheet that contains the dimer 
forming residues. Residues in this region are E53, P57, 
and P59. Residues in alpha helix 1 influencing the 
orientation of principal set residues are K8, All, and M12. 
Residues in the antiparallel beta sheet formed by the beta 

35 strands 1, 2, and 3 (see Table 1> in each monomer also 
influence residues in the principal set. Th se residues 
include 15, T6, K39, F41, V50, and Y51. Thus the set of 
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secondary residues includes: 5,. 6> 8, 11, 12, 41, 50, 51, 
53, 57, and 59. 

Pick the range of varia tion for alHeration of DNA binding: 

5 

For -the initial variegation step to produce a modified 
Rav protein with altered DNA specificity a set of 5 
residues from 1:he principal set is picked. Focused 
Mutagenesis is used to vary all five residues through all 
10 twenty amino acids. The residues are be picked from the 
same interaction set so that as many as 3.2 x 10*^ different 
DNA binding sxirfaces will be produced. 

A number of studies have shown that the residues in 
15 the N-terminal half of the recognition helix of an H-T-H 
protein strongly influence the sequence specificity and 
strength of protein binding to DNA (H0CH86a,b, WHAR85, 
PAB084}. For this reason we choose residues Y26, Q27, S28, 
N31, and K32 from the principal set as residues to vary in 
20 the first variegation step. Using the optimized nucleotide 
distribution for Focused Mutagenesis described above, and 
assuming that S^^j- = 5% as defined at the start of this 
Example, the parental sequence is present in the variegated 
mixture at one part in 3.1 x 10^ and the least favored 
25 sequence, F at each residue, is present at one par^ in 10^. 
Thus, this level of variegation is well within bounds for a 
synthesis, ligation, transformation, and selection system 
capable of examining 5 x 10^ DNA sequences. 

30 Pick the range of variation of residues for alteration of 

diTn^yiz^^tirPT?: 

As described in the Detailed Description and in this 
Example, altered \ Cro proteins, Ravj^ and Rav^, that bind 
35 specifically and tightly to left and right syiometrized 
targets derived from HIV 353-369, are first d velbped 
thr ugh one or more vari gati n steps. Site-specific 
changes are then engineered into rav j^ to pr duce dimeriza- 
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tion defective proteins. Structure-directed Mutagenesis is 
p rf rmed on ravp to produce mutations in Rav^ that can 
cbirtplement dimerization defective Ravj^ proteins and produce 
obligate heterodimers that bind to HIV 353-369. 

5 

One of the interactions in the dimerization region of 
X Cro is the hydrophobic contact between residues V55 of 
both monomers. The VF55 mutation substitutes a bulky 
hydrophobic side group in place of the smaller hydrophobic 

10 residue; other substitutions at residue 55 can be made and 
tested for their ability to dimerize. A small hydrophobic 
or neutral residue present at residue 55 in a protein 
encoded on expression by a second gene may result in 
obligate complementation of VF55. In addition, changes in 

15 nearby components of the beta strand, E53, E54, K56, and 
P57 may effect complementation. Thus a set of residues for 
the initial variegation step to alter the RavR dimer 
recognition is 53, 54, 55, 56, and 57. 

20 Another interaction in the dimerization region of X 

Cro is the hydrophobic contact between F58 of one monomer 
with the hydrophobic core of the other monomer. As 
mentioned above residues L7, L23, V25, A33, 140, L42, A52, 
and E54 of one monomer all could mz*e contacts with a 

25 large residue at position 58 in the other monomer. The 
FW58 mutation inserts the largest aromatic amino acid at 
this position. Compensation for this substitution may 
require several changes in the hydrophobic core of the 
complementing monomer. Residues for Focused Mutagenesis in 

30 the initial variegation step to alter Ravp dimer recogni- 
tion in this case are: 23, 25, 33, 40, and 42. 

In each of the two cases described above, the initial 
variegation step involves Focused Mutagenesis to alter 5 
35 residues through all twenty amino acids. As was shown in 
Section 6.2.5, this level of variegation is within the 
limits set by using optimized codon distributions and the 
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values for S^rr transf rmation yield assxined at the 

start of this Example. 



5 



Mutagenesis of DNA: 



Codons encoding X Cro residues Y26, Q27, S28, N31, and 
K32 are contained in a 51 bp £bi^I to £glll fragment of the 
rav gene. To produce the cassette containing the varie- 
gated codons we synthesize the 66 nucleotide antisense 
10 variegated strand, 6lig#50, and the primer, olig#52: 

d 1 g V X X X a i X 

22 23 24 25 26 27 28 29 30 31 

5» t CCt aAG GAC CTA GGG GTG f zk f Zk f zk GCG ATT fzk 
15 t PduM It 



X a i h a g r k i 
32 33 34 35 36 37 38 39 40 
20 fzk GCC ATC CAT GCC GGC CGA AAG ATC Tt 3' olig#50 

3'-ccg get ttc tag aacgccgtg-5« olig#52 
13^3. XI t 

The position of the amino acid residue in X Cro is shown 
25 above the codon for the residue. Unaltered residues are 
indicated by their lower case single letter cimino acid 
codes shown above the position number. Variegated residues 
are denoted with an upper case, bold X. The restriction 
sites for PduMI and Bal lX are indicated below the sequence. 
30 Since restriction enzymes do not cut well at the ends of 
DNA fragments, 5 extra nucleotides have been added to the 
5' end of the cassette. These extra nucleotides are shown 
in lower case letters and are removed prior to ligating the 
cassette into the operative vector. The sequence "fzk" 
35 denotes the variegated codons and indicates that nucleotide 
mixtvires optimized for codon positions l, 2, or 3 are to be 
used. "f" is a mixture of 26% T, 18% 26% A, and 30% G, 
producing four possibilities. "z" is a mixture f 22% T, 
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16% C, 40% and 22% G, producing four possibilities, "k" 
is an equimolar mixture of T and G, producing two possibil- 
ities. Each "fzk" codon produces 4x4x2=25«32 
possible DNA sequences, coding on expression for 20 
5 possible amino acids and stop. The DNA segment above 
comprises (2^)^ = 2^5 = 3.2 x lo' different DNA sequences 
coding on expression for 20^ = 3.2 x 10^ different protein 
sequences . 

10 After synthesis and purification of the variegated 

DNA, the oligonucleotides #50 and #52 are annealed and the 
resulting superoverhang is filled in using Klenow fragment 
as described by Hill (AUSU87, Unit 8.2). The double 
stranded oligonucleotide is digested with the enzymes PduMI 

15 and Bal ll and the mutagenic cassette is purified as 
described by Hill. The mutagenic cassette is cloned into 
the vectors pEPlOll and pEP1012 which have been digested 
with PpuM I and BamH I . and the ligation mixtures containing 
variegated DNA are used to transform competent delta4 

20 cells. The transformed cells are selected for vector 
uptake and for successful repression at low stringency as 
described above. Cells containing Rav proteins that bind 
to the left or right symmetrized targets display the Tc^, 
Fus^ and Gal^ phenotypes. 

25 

Surviving colonies are screened for correct DBF"*" and 
DBF" phenotypes in the presence or absence of IPTG as 
described above. Relative measures of the strengths of 
DBP-DNA interactions In vivo are obtained by comparing 

30 phenotypes exhibited at reduced levels of IPTG. DBF genes 
from clones exhibiting the desirable phenotypes are 
sequenced • Plasmid numbers from pEPllOO to pEP1199 are 
reserved for plasmids yielding rav ^ genes encoding proteins 
that bind to the Left Symmetrized Targets carried on the 

35 plasmids. Similarly, plasmid niimbers pEP1200 through 
pEP1299 plasmids containing iMn genes encoding proteins 
that bind to the Right Symmetrized Targets carried on these 
plasmids • 
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Based on the det nalnations above, one or more Rav^ 
and RavR proteins are chosen for further analysis in vitro . 
Proteins are purified as described above. Purified DBPs 
5 are quantitated and characterized by absorption spec- 
troscopy and polyacrylamide gel electrophoresis. 

In vitro laeasurements of protein-DNA binding using 
purified DBPs are performed as described in the Overview: 

10 DNA-Binding, Protein Purification, and Characterization and 
in this Example. These measurements determine equilibrium 
binding constants (Kj^) , and the dissociation (k^) and 
association (kg^) rate constants for sequence-specific and 
sequence-independent DBP-DNA complexes. In addition, DNase 

15 protection assays are used to demonstrate specific DBP 
binding to the Target sequences. 

Estimates of relative DBP stability are obtained from 
measurements of the thermal denaturation properties of the 
20 proteins. In vitrg measures of protein thermal stability 
are obtained from deterainations of protein circular 
dichroism and resistance to proteolysis by thermolysin at 
various temperatures (HECH84) or by differential scanning 
calorimetry (HECHBSb) . 

25 

One or more iterations of variegation, involving 
residues thought capable of influencing DNA binding, of 
the xavL and £avR genes produce Ravj, and Rav|^ proteins 
that bind tightly and specifically to the HIV 353-369 left 
30 and right symmetrized targets. Additional variegation 
steps, to optimize protein binding properties can be 
performed as outlined in the Overview: Variegation Stra- 
tegy. 

35 By hypothesis, we isolate pEP1127 that contains a 

Pdbp gene that cod s on expression for RavL-27, shown in 
Table 114, that binds th left-symmetrized target b st 
among selected Rav^ proteins. Similarly, pEP1238 c ntains 
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a pdbp gene that codes on expression for Ravi^-38, shown in 
Table 115, that binds the right-symmetrized target best 
among selected Ravj^ proteins . 

5 We now use the genes for the Rav^ and Rav^ monomers as 

starting points for production of obligately heterodimeric 
proteins Ravj^zRavj^ that recognize the HIV 353-369 target. 
First we change the target sequences in pEP1238 (containing 
ravp -38) • We replace both occurrences of the Right 

10 Symmetrized Target (in tet and aalT.K promoters) with the 
HIV 353-369 target sequence. Delta4 cells containing 
plasmids carrying the HIV 353-369 targets display the Ap^, 
Tc^, Fus^ and'Gal^ phenotypes. Plasmids carrying HIV 353- 
369 targets and the ravp gene are designated by nximbers 

15 pEPlAOO through pEP1499 and corresponding to the nximber of 
the donor plasmid of the 1200 series; for example, replac- 
ing the target sequences in pEP1238 produces pEP1438* 



20 



Engineering dimerizatiori mutants of Rav]^ 

To create the site specific VF55 and FW58 mutations in 
ray^ we synthesize the two mutagenesis primers: 

a e e 1 k p f 
25 52 53 54 55 56 57 58 

51 GGC GAA GAG TTC AAG CCC TTC 3" VF55 

primer 

V k p M p s n 
30 55 56 57 58 59 60 61 

5 • GTA AAG CCC SSS CCC AGT AAC 3 ■ FW58 
primer 

Underlining indicates the varied codbns and residues. The 
35 plasmid pEP1127 (containing EaXL"27) is chosen for mutagen- 
esis. The gene fragment coding on expr ssion for th 
carboxy-terminal region of the Rav^ protein is transferred 
into M13mpl8 as a BamH I to XbiH fragment. Oligonucleotide- 
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directed mutagenesis is p rf ormed as describ d by Ktmkel 
(AUSU87, Unit 8.1). The fragment bearing the modified 
region of Rav^ is removed from M13 RF DNA as the BamH l to 
Kpn l fragment and ligated into the correct location in the 
5 pEPllOO vector. Mutant "bearing plasmids are used to 
transform competent cells. Transformed cells are selected 
for plasmid uptake and screened for DBP" phenotypes (Tc^, 
FusS, and Gal^ in ]^ coli delta4 ; Gal"*" in ^ coli HBlOl) . 
Plasmids isolated from DBP~ cells are screened by restric- 
10 tion analysis for the presence of the rav j^ gene and the 
site-specific mutation is confirmed by sequencing. The 
plasmid containing the rav^ -27 gene with the VF55 mutation 
is designated p£P1301. Plasmid pEP1302 contains the rav j^- 
27 gene with the FW58 alteration. 

15 

For the production of obligate heterodimers as 
described below ^ the rav j^" genes encoding the VF55 or FW58 
mutations are excised from pEP1301 or pEP1302 and are 
transferred into plasmids containing the gene for IQa and 

20 neomycin resistance f neo . also known as not H) . These 
constructions are performed in three steps as outlined 
below. Firsts the neo gene from TnS coding for Kia^ and 
contained on a 1.3 Kbp Hindl ll to Sma l DNA fragment is 
ligated into the plasmid pSP64 (Promega, Madison ^ WI) 

25 which has been digested with both Hind XIX and Sma l. The 
resulting 4.3 kbp plasmid, pEP1303, confers both Ap and 
resistance on host cells. Next, the bla gene is removed 
from pEP1303 by digesting the plasmid with Aat ll and Bal l. 
The 3.5 Kbp fragment resulting from this digest is purifi- 

30 ed, the 2\ overhanging ends are blunted using T4 DNA 
polymerase (AUSU87, Unit 3.5), and the fragment is recir- 
cularized. This plasmid is designated pEP1304 and trans^ 
forms cells to Km resistance. In the final step, the rav^*' 
gene is incorporated in to pEP1304. Plasmid pEP1301 or 

35 PEP1302 is digested with Sfil and the resulting 3* over- 
hangs are blunted using T4 DNA polymerase. Next the 
linearized plasmid is digested with Spe l and the resulting 
5' overhangs are blunted using the Klenow enzyme reaction 
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(KIiEN70) . The ca. 340 bp blunt-ended DNA fragment contain- 
ing the entire rav ^^ gene is purified and ligated into th 
Pvu II site in pEP1304, Transformed cells are selected for 
Km^ and screened by restriction digest analysis for the 
5 presence of rav ^" genes. The presencis of rav ^*" genes 
containing the site-specific VF55 or FW58 mutations is 
confirmed by sequencing* The plasmid containing the rav j^*" 
gene with the VF55 mutation is designated pEP1305. The 
plasmid containing the rav jj" gene with the FW58 mutations 
10 is designated pEP1306, 

In a manner similar to the constructions described 
above, we ligate the original unmodified rav ^ gene into 
pEP1304 to produce plasmid pEP1307. 

15 

Enaineeri na heterodimer binding of target DNA; 

This round of variegation is performed to produce 
mutations in Ravj^ proteins that complement the dimer- 
20 ization deficient mutations in the Ravj^ proteins produced 
above. To complement the FW58 mutation, the set of five 
residues L23, V25, A33, 140/ and L42 are chosen from the 
primary set of residues as targets for Focused Mutagenesis. 

25 In an initial series of procedures to test for 

recognition of HIV 353-369 by the heterodimer RavLrRavj^, we 
transform cells containing pEP1438 (containing ravp -38 and 
HIV 353-369 targets) with pEP1307 (containing rav^) • 
Intracellular expression of rav ^^ and rav^ produces a 

30 population of dimeric repressors: RavL:RavL, Ravj^xRavj^ and 
Ravj^tRav^. If the heterodimeric protein is fomned and 
binds to HIV 353-369, cells expressing both rav alleles 
will exhibit the Km^ Ap^ Gal^ Pus^ phenotypes (vide 
infra ) • Several pairs of rav ^ and ravp genes are used in 

35 parallel procedures; the best pair is picked for use and 
further study. Sel ctions for binding the HIV 353-369 
target by th heterodimeric prot in can be optimiz d using 
this system* 
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Focused Mutagenesis of residues 23, 25, 33, 40, and 42 
requires the synthesis and annealing of two overlapping 
variegated strands because in the rav gene a single 
5 cassette spanning these residues extends from 1:he Bal l site 
to the BamH I site and exceeds the assumed synthesis limit 
of 100 nucleotides. As no variegation affects the overlap, 
the annealing region is complementary. The antisense 
strand of the DNA sequence from the Ball site blunt end to 
10 the end of the codon for G37 is denoted olig#53. 

q t kt akd XgXyq 
16 17 18 19 20 21 22 23 24 25 26 27 
5> C CAA ACC AAG ACA GCG AAG GAC fzk GGG fzk TAT CAG 
15 tBallT 

sainkXihag 
28 29 30 31 32 33 34 35 36 37 
AGC GCG ATT AAC AAG fzk ATC CAT GCC GGC 3< olig#53 

20 

f = (26% T, 18% C, 26% A, 30% G) 
z = (22% T, 16% C, 40% A, 22% G) 
k = equimolar T and G 

25 Olig#53 contains vg codons for residues 23, 25, and 33. 

01ig#54 is the sense strand from base 1 in codon 34 
to the BamH I site: 

30 i h a g r k X 

34 35 36 37 38 39 40 
31 TAG GTA CGG GCG GCA TTC jqm 

fXt ina dnk 
35 41 42 43 44 45 46 47 48 49 

AAG jqm TGG TAA TTG CGA CTA CCT AGG cca ca 51 olig#54 

I BamHI I 
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j = (26% A, 18% G, 16% T, 30% C) 
q = (22% A, 16% G, 40% 22% C) 
m = equimolar A and C 

5 Olig#54 contains variegated codons for residues 40 and 42. 
since olig#54 . is the sense strand, the variegated nucle- 
otide distributions must complement the distributions for 
codon positions 1, 2, and ^ used in the antisense strand. 
These sense codon distributions are designated "j", "q", 

10 and "m" , and represent the complements to the optimized 
codon distributions developed for codon positions 1, 2, and 
3, respectively, in the antisense strand. The two strands 
(olig#53 and olig#54) share a 12 nucleotide overlap extend- 
ing from the first position in the codon for 134 to^ the end 

15 of the codon for G37. The overlap region is 66% G or C. 

The two strands shown above are synthesized, purified, 
annealed, and extended to form dsDNA. Following restric- 
tion endonuclease digestion and purification, the mutagenic 
20 cassettes are ligated into pEP1438 (containing the asym- 
metric HIV 353-369 target) in the appropriate locus in the 
rav p^ gene. The ligation mixtures are used to transform 
competent cells that contain pEP1306 (the plasmid with the 
rav j^ gene carrying the FW58 site-specific mutation) . 

25 

Above we picked a set of five residues in X Cro, E53, 
E55, V55, K56, and P57, as targets for focused mutagenesis 
in the first variegation step of the procedure to produce a 
Ravj^ protein that complements the dimerization-def icient 
30 VF55 Rav^ mutation. These five residues are contained on a 
71 bp BamH I to Kpn l fragment of the rav gene (Table 100) . 
To produce a cassette containing the variegated codons we 
synthesize olig#58: 
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g s V y a X X X X X f 
48 49 50 51 52 53 54 55 56 57 58 
5 • ct gat GGA TCC GTC TAG GCG f zJc f zk f Zk f Zk f zk TTC 
iBamHlj - 

5 

p s n k k 
59 60 61 62 63 
CCG AGT AAC AAA AAA 

10 t t a . 

64 65 66 67 
ACA ACA GCG TAA TAGTAGGTACC ta 3« olig#58 

15 After synthesis and purification of the vgDNA, 

strands are self-aTinealed using the 10 nucleotide palin* 
drome at the 3' end of the sequence. The resulting 
superoverhangs are filled in using the Klenow enzyme 
reaction as described previously and the doxible-stranded 

20 oligonucleotide is digested with BamH I and Kpnl > Purified 
mutagenic cassettes are ligated into one or more operative 
vectors (picked from the pEP1200 series) in the appropriate 
locus in the rav^ ^ gene. The ligation mixtures are used to 
transform competent cells that contain p£P1305 (the plasmid 

25 carrying the ravj ^"^ gene with the FV55 mutation) . 

Operative vectors carrying the VF55 or FW58 mutation 
in ravj ^ confer Km resistance. Operative vectors carrying 
mutagenized £av^ genes contain the gene for Ap^ as well as 

30 the selective gene systems for the DBP"*" phenotypes. Cells 
containing complementing mutant proteins are selected by 
requiring both Ap^ and Km^ and repression of the complete 
HIV 343-369 target sequence (substituted for the Left and 
Right Symmetrized Targets in the selection genes) • Cells 

35 possessing the desired phenotype are Ap^, Km^, Fus^, and 
Gal^ (in Ej^ coli delta!) • 



wo 90/07862 



PCT/US90/00024 



135 

Plasmids from candidate colonies are first isolated 
genetically by transformation of cells at low plasmid 
concentration. Cells carrying plasmids coding for Rav^ 
proteins will be Km^, while cells carrying plasmids coding 
5 for RavR proteins will be Ap^, Plasmids are individually 
screened to ensure that they confer the DBP" phenotype and 
are characterized by restriction digest analysis to confirm 
the presence of Trav^'^ or l^j^ genes ^ Plasmid pairs are 
co-tested for complementation by restoration of the DBP"^ 
10 phenotype when both ra^R and rav ^ are present intracel- 
lularly. Successfully complementing plasmids are sequenced 
through the rav genes to identify the mutations and to 
suggest potential locations for optional subsequent rounds 
of variegation, 

15 

Plasmids containing genes for altered Ravjj proteins 
that successfully complement the rav ^ VF55 mutation are 
designated by plasmid nximbers pEPlSOO to pEP1599. Similar- 
ly, plasmids containing genes for altered Ravj^ proteins 
20 that successfully complement the xaxL FW^S mutation are 
designated by plasmid numbers pEPlSOO to pEP1699. 

Heterodimeric proteins are purified and their DNA- 
binding and thermal stability properties are characterized 

25 as described above. Pairwise variation of the Rav|^ and 
RavL monomers can produce dimeric proteins having different 
dimerization or dimer-DNA interaction energies. In 
addition, further rounds of variegation of either or both 
monomers to optimize DNA binding by the heterodimer, 

30 dimerization strength or both may be performed. 

In this manner a heterodimeric protein that recognizes 
any predetermined target DNA sequence is constructed. The 
foregoing is hypothetical. The sequences shown as the 
35 result of selection are given by way of excunple and must 
not be construed as predictions that proteins of the stated 
sequence will have specific affinity for any DNA sequence. 
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Example 2 

5 Present:ed below is a hypothetical example of a 

protocol for developing new ONA*-binding polypeptides, 
derived from the first ten residues of phage P22 Arc and a 
segment of variegated polypeptide with affinity for DNA 
subsequences fotind in HIV--1 using ^ coli K12 as the host 
10 cell line. Some further optimization, in accordance with 
the teachings herein, may be necessary to obtain the 
desired results. Possible modifications in the preferred 
method are discussed immediately following the hypothetical 
example. 

15 

We set the same hypothetical technical capabilities as 
used in Detailed Example 1. 

20 

To obtain significant binding between a genetically 
encoded polypeptide and a predetenoined DNA subsequence, 
the surfaces must be complementary over a large area, 1000 
&2 to 3000 For the binding to be sequence-specific, 

25 the contacts must be spread over many (12 to 20) bases. An 
extended polypeptide chain that touches 15 base pairs 
comprises at least 25 amino acids. Some of these residues 
will have their side groups directed away from the DNA so 
that many different amino acids will be allowed at such 

30 residues, while other residues will be involved in direct 
DNA contacts and will be strongly constrained. Unless we 
have 3D structural data on the binding of an initial 
polypeptide to a test DNA subsequence, we can not a priori 
predict which residues will have their side groups directed 

35 toward the DNA and which will have their side groups 
directed outward. We also can not predict which amino 
acids should be used to specifically bind particular base 
pairs. Current technology allows production of lo"^ to 10^ 
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independent transformants per ug of DNA which allows 
variation of 5 or 6 residues through all twenty amino 
acids. Alternatively, between 23 and 30 two-way variations 
of DNA bases can be applied that will affect between 8 and 
5 30 codons. 

Sauer and colleagues (V£RS87b) have shown that P22 
Arc binds to DNA using a motif other than H-T-*H. There is 
as yet no published X-ray structure of Arc, though the 

10 protein has been crystallized and diffraction data have 
been collected (J0RD85) • A combination of genetics and 
biochemistry indicates that the first 10 residues of each 
Arc monomer (M-K-G-M-S-K-M-P-Q-F) bind to palindromically 
related sets of bases on either side of the center of 

15 symmetry of the 21 bp operator shown in Table 200. 
Furthermore, the first ten residues of each Arc monomer 
assume an extended conformation (V£RS87b) . The hydrophobic 
residues may be involved in contacts to the rest of the 
protein, but there are several examples from H-T-H DBFs of 

20 hydrophobic side groups being in direct contact with bases 
in the major groove. We do know that these first ten 
residues of Arc can exist in a conformation that makes 
sequence-specific favorable contacts with the arc operator. 

25 We pick a target DNA subsequence from the HrsT-l genome 

such that a portion of the chosen sequence is similar to 
one half-site of the arc operator. We use part of this 
chosen sequence for an initial chimeric target. One half 
of the first target is the DNA stibsequence obtained from 

30 HIV-l and the other half of the target is one half-site of 
the arc operator. For this example, we will use a plasmid 
bearing wild-type arc operators repressed by the Arc 
repressor as a control. After demonstrating that Arc 
repressor can regulate the selectable genes, we replace the 

35 wild-type arc operator with the target DNA subsequence. We 
then replace the arc gene with a variegated pdbp gene and 
select for cells expressing DBFs that can repress the 
selectable genes. 
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Once a protein is obtained that binds to the target 
tiiat has similarity to one half of the Arc operator, we can 
change the target so that it has less similarity to one 
5 half of the Arc operator and mutagenize those residues that 
correspond to residues 1-10 of Arc. 1q vivo selection will 
isolate a protein that binds to the new target. A few 
repetitions of this process can producer a polypeptide that 
binds to any predetermined DNA sequence. 

10 

Our potential DNA-binding polypeptide (DBF) will be 36 
residues long and will contain the first ten residues of 
Arc which are thought to bind to part of the half operator. 
DNA encoding the first ten amino acids of Arc is linked at 

15 the 3 ■ terminus of this gene fragment to vgDNA that encodes 
a further 26 amino acids. Twenty-four of the codons encode 
two alternative amino acids so that 2^^ » approx. 1.6 x 10^ 
protein sequences result. The amino acids encoded are 
chosen to enhance the probability that the resulting poly- 

20 peptide will adopt an extended structure and that it can 
make appropriate contacts with DNA. The Chou-Fasman 
(CH6U78a, CH0U78b) probabilities are used to pick amino 
acids with high probability of forming beta structures (M, 
V, I, C, F, Y, Q, W, R, T) ; the amino acids are grouped 

25 into five clasfses in Table 16. In addition, to discourage 
sequence-independent DNA binding, some acidic residues 
should be included. Glutamic acid is a strong alpha helix 
former, so in early stages we use D exclusively. Further, 
S and T both can make hydrogen bonds with their hydroxyl 

30 groups, but T favors extended structures while S favors 
helices; hence we use only T in the initial phase. 
Likewise, N and Q provide similar functionalities on their 
side groups, but Q favors beta and so is used exclusively 
in initial phases. Positive charge is provided by K and R, 

35 but only R is used in the variegated portion. Alanine 
favors h lices and is excluded. P kinks the chain and is 
allowed only near the carboxy terminus in initial itera- 
tions. 
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After one selection, ve design a different set of 
binary variegations that includes the selected sequence 
and perform a second mutagenesis and selection. After two 
5 ' or more rounds of diffuse variegation and selection, we 
choose a subset of residues and vary them through a larger 
set of amino acids. We continue until we obtain sufficient 
affinity and specificity for the target. None of the 
polypeptides discussed in this example is likely to have a 

10 defined 3D structure of its own, because they are all too 
short. Even if one folded into a definite structure, that 
structure is unlikely to be related to DNA-binding. A 3D 
structure, obtained by X-ray diffraction or NMR, of a DNA- 
polypeptide complex would give us useful indications of 

15 which residues to vary. Scattering the variegation along 
the chain and sampling different charges, sizes, and 
hydrophobicities produces a series of proteins, isolated by 
in vivo selection, with progressively higher affinity for 
the target DNA sequence. 

20 

Selection systems are the same as used in Example 1, 
viz . fusaric acid to select against cells expressing the 
25 tet gene and galactose killing by qalT.K in a aalE deleted 
host. First, in three genetic engineering steps, we 
replace: a) the rav gene in pEP1009 with the arc gene, and 
b) the target DNA sequences (both occurrences) with the arc 
operator. The resulting plasmid is our wild type control. 

30 

To replace rav with arc , the synthetic arc gene, 
shown in Table 201 and Table 202, is synthesized and 
ligated into pEPlOOS that has been digested with BstE II 
and Kpn I. Cells are transformed and colonies are screened 
35 for Tc^. The plasmid is named pEP2000. Delta4 cells 
transformed with pEP2000 ar Tc^ and Gal^ because pEP2000 
lacks the rav gene. 
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To insert the arc operator into the neo promoter 
CPneo) tet gene in pEP2000, we digest pEP2000 with 

afcul and HinSIII and ligate the pxirified backbone to 
annealed synthetic olig#430 and olig#432. 

5 

Arc operator and Pneo that promotes tet 
5* |gct[gcg|aac1cgg|aat|tgc|cag1- 

Olig #430 = 3' gga cgc ttg gcc tta acg gtc- 
10 I gtui I I ■?$ I 

I CTG I GGG I CGC 1 CCT I CTG I GTA 1 AGG I TTG 1 - 
gac ccc gcg gga gac cat tec aac- 

I "10 1 

15 

I GGA I ATG I ATA | GAA | GCA | CTG [ TAG | TAT [ A 3 • «=01ig#4 3 2 

cct tac tat ctt cgt gag atg ata t teg a 5* 
i« ^ — Arc operator | | Hind3 | 

20 

The plasmid is named pEP2001 and confers Fus^^ Gal^, Ap^ on 
delta4 cells. 

To insert the arc operator into the amp promoter for 
25 the aalT.K genes in pEP2001, we digest pEP2001 with Apa l 
and 2SbaI and ligate the purified backbone to synthetic 
olig#4l6 and olig#417 that have been annealed in the 
standard way. 
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Arc operator and Pamp that promotes aalT.K 

5' |ctt|cta|aat|aca|ttc|aaa|- 
01ig#4l7 3 ■ c egg gaa gat tta tgt aag ttt- 

I Apai I J -?s I 

I TAT i GTA I TCC | GCT | CAT | GAG | ACA | ATA | ACC ( - 
ata cat agg cga gta etc tgt tat tgg- 

I -10 I 

|CTT|ATG|ATA|GAA|GCA(CTC|TAC|TAT| CGT 3»Olig#416 

gaa tac tat: ctt cgl: gag atg ata gca gat c 5* 
I Arc Operator | | Xbal | 

15 The plasmid is named p£P2002 and confers Gal^, Fus^, Ap^ on 
delta4 cells. This plasmid is our wild type for work with 
polypeptides that are selected for binding to target DNA 
subsequences that are related to the arc operator. 

20 Development of polypeptides that bind chimeric target DNA: 

We now replace: 

a) the two occurrences of the arc operator with the 
25 first target sequence that is a hybrid of the arc 

operator and a subsequence picked from HIV-1, and 

b) the arc gene by a variegated pdbp gene. 

30 A hybrid non^^palindrbmic target sequence is used in 

this example because selection of a polypeptide using a 
palindromic or nearly palindromic target DNA subsequence is 
likely to isolate a novel dimeric DBF. The goal of this 
procedure is to isolate a polypeptide that binds DNA but 

35 that does not directly exploit the dyad symmetry of DNA. 
The binding is most likely in th major gr ov , but the 
present invention is not limited to polypeptides that bind 
in the major groove. The selections ar perform d using a 
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non-synunetrlc target to avoid Isolation of novel diners 
that support two symmetrically related copies of the 
original recognition elements. 

5 The non-variable regions of the HIV-1 genome, as 

listed in Example 1/ were searched using a half operator 
from the arc operator as search sequence. 

We sought subsequences in the non-variable sec[uences 

10 of the HIV-1 genome that match either half of the consensus 
P22 srs operator shown in Table 200. Subsequences that are 
closer to the start of transcription are preferred as 
targets because proteins binding to these subsec[uehces will 
have greater effect on the transcription of the genes. No 

15 sequence was found that matched all six unambiguous bases; 
the. subsequences at 1024, 1040, and 2387 (shown in Table 
203) each have a single mismatch. Lower case letters in 
the " arcO sequence indicate ambiguity in the P22 arc 

operator sequence. Lower case, bold, underscored letters 

20 in the HIV-1 subsequences indicate mismatch with the 
consensus arc operator. Two other subsequences, shown in 
Tcible 203, have one mismatch at one of the conserved bases 
and one mismatch with one of the ambiguous bases. The 
HIV-1 subsequence that starts at base 1024 is chosen as a 

25 target sequence. We replace the 3« ten bases of the arc 
operator with the 3 ' ten bases of this subsequence to 
produce the hybrid target sequence: 

ATGATAGAAG | C | GCAACCCTC . 

30 

We insert this sequence into the promoter that regulates 
tet in PEP2002 by ligating dsDNA composed of an equimolar 
mixture of olig#440 and olig#442 into the Stul/ Hind lll site 
of pEP2002. Substitution of the arc operator by the arc - 
35 HIV-1 hybrid sequence relieves the repression by Arc. The 
c nstruction is called pEP2003 and conf rs Tc^, Ap^, Gal^ 
on delta4 cells. 
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First Target and P^eo that pr motes tet 

5* |cct|gcg|aac|cgg|aat|tgc|cag|- 
Olig#440 «= 3' gga cgc ttg gcc tta acg gtc- 

5 I Stul I I -3$ I 

I CTG I GGG j CGC I CCT | CTG | GTA | AGG | TTG | - 
gac ccc gcg gga gac cat tec aac- 

I I 

10 

ATA ATA CAG TAg caa ccc tct = HIV 1024-1044 
I GGA I ATG I ATA | SAA | SSg | caa | CCC | tct ] A 3 ' =01ig#442 

cct tac tat ctt cgC GTT GGG AGA t tog a 5* 
I First Target | |Hlnd3 | 

The second instance of the target is engineered in 
like manner, using p£P2003 first digested with Apa l and 
Xba l and then ligated to annealed olig#444 and olig#446. 
The plasmid is called p£P2004 and confers Gal"^, Tc^, Ap^ on 
20 HBlOl cells. The plasiaid p£P2004 contains the first target 
sequence in both selectable genes and is ready for intro- 
duction of a variegated t>dbp gene. 



First Target and Pamp that promotes aalT.K 

25 

5' |ctt|cta|aat|aca|ttc|aaa| 
01ig#444 3« c egg gaa gat tta tgt|aag ttt] 



30 



35 



1tat|gta|tcc|gct1cat|gag|aca|ata|acc|ct- 
ata cat agg cga gta etc tgt tat tgg ga 

I -10 I 



T|ATG|ATA|SAA|G£g|caa|cce|tet| CGT 3»01ig#446 
a tac tat ctt cgC GTT GGG AGA gca gat c 5' 
I First Target 1 \ Xt?ftl [ 
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The variegated DNA for a 36 amino acid polypeptide is 
shown in Table 204. This DNA encodes the first ten amino 
acids of P22 Arc followed by 26 amino acids chosen to be 
5 likely to form extended structures. In Table 204, we 
indicate variegation at one base by using a letter, other 
than A, C, G, or T, to represent a specific mixture of 
deoxynucleotide isubstrates. The range of amino acids 
encoded is written above the codon number: 

10 

liH 

I 111 
lATs] 

15 ' ' ^ 

indicates that the first base is synthesized with A, the 
second base with T, and the third base with a mixture of C 
and G, arid that the resulting DNA could encode amino acids 
I or M. That the parental protein has isoleucine at 

20 residue 11 is indicated by writing I first. Residues 22 
and 23 are not variegated to provide a homologous overlap 
region so that olig#420 and olig#421 can be annealed. 
After olig#420 and olig#421 are annealed and extended with 
Klenow fragment and all four deoxynucleotide triphosphates, 

25 the DNA is digested with both BstEI I smd Bsu36 I and ligated 
into pEP2d04 that has also been digested with BstE II and 
BSU36 I. The ligated DNA, denoted vgl-pEP2004, is used to 
transform Delta4 cells. After an appropriate grow out in 
the presence of IPTG, the cells are selected with fusaric 

30 acid and galactose. 

By hypothesis, we recover ten colonies that are Gal^ 
and Fus^. We sequence the plasmid DNA from each of these 
colonies. A hypothetical DBP amino acid sequence from one 
35 of these colonies is shown in Table 205. 

Comparison of the amino-acid sequences of different 
isolat s may provide useful information on which residues 
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play crucial roles in DNA binding. Should a residue 
contain the same sanino acid in most or all isolates, we 
might infer that the selected amino acids is. preferred for 
binding to the target sequence. Because we do not know 
5 that all of the isolates bind in the same manner, this 
inference must be considered as tentative. Residues closer 
to the unvaried section that have repetitive isolates 
containing the same amino acid are more informative than 
residues farther away. 

10 

In a second round of Diffuse Mutagenesis, we vary the 
codons shown in Table 206. Residues 1 through 10 are not 
varied because these provide the best match for the first 
ten bases of the target. Residues 19, 20, and 21 are not 

15 varied so that the synthetic oligonucleotides can be 
annealed. The two-way variations at residues 11 through 18 
and 23 through 36 all allow the selected amino acid to be 
present, but also allow an as-yet-untested amino acid to 
appear. It is desirable to introduce as much variegation 

20 as the genetic engineering and selection methods can 
tolerate without risk that the parental DBP sequence will 
fall below detectable level. Having picked three residues 
for the homologous overlap, we have only 23 amino acids to 
vary. Thus residue 22 is varied through four possibil- 

25 ities instead of only two. Residue 22 was chosen for 
four-*way variegation because it is next to the unvaried 
residues. We use pEP2004 as the backbone, and ligate DNA 
prepared with Klenow fragment from oligonucleotides #423 
and #424 (Table 206) to the BstE II and Bsu36 I sites. The 

30 resulting population of plasmids containing the variegated 
DNA is denoted vg2-p£P2004* 

Table 207 shows the amino acid sequence obtained from 
a hypothetical isolate bearing a DBP gene specifying a 
35 polypeptide with improved affinity for the target. Changes 
in amino acid sequence are observed at t n positions. 
Comparisons of the sequences from s veral such isolates as 
well as those obtained in the first round of mutagenesis 
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can be used to locate residues providing significant DNA- 
binding energy. 

Having established some affinity for the target^ we 
5 now seek to optimize binding via a more focused mutagenesis 
procedure. Table 208 shows a third variegation in which 
twelve residues in the variable region are varied through 
four amino acids in such a way that the previously selected 
amino acids may occur. Again, pEP2004 is used as backbone 

10 and synthetic DNA having cohesive ends is prepared from 
olig#325 and olig#327. The plasmid is denoted vg3-pEP2004. 
In subsecpience variegation, ^ we would vary other residues 
through four amino acids at one time. By hypothesis, we 
select the polypeptide shown in Table 209 that has high 

15 specific affinity for the first target; now we can: 

a) replace both occurrences of the first target by a 
second target, i.e. the intact HIV-1 subsequence 
(1024-1044) , and 

20 

b) use the selected polypeptide as the parental DBF to 
generate a variegated population of polypeptides from 
which we select one or more that bind to the second 
target. 

25 

Because the second target differs from the first in the 
region thought to be bound by residues 1 through 10 of the 
parental DBF, we concentrate our variegation within these 
residue for the first several rounds of variegation and 
30 selection. 

We replace the target DNA sequence in the neo promoter 
for tgt in PEP2002 with ds DNA comprising synthetic 
olig#450 and olig#452 at the Stul/Jiin^ixi site. The new 
35 plasmid is named pEP20iO and confers Tc^ on delta4 cells. 
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Second Target and Pneo ^^^^ promotes tej 

5' . |cct|gcg|aac|cgg|aat|tgc|cag|- 
01ig#450 = 3 ' gga cgc ttg gcc tta acg gtc- 
1 SMul I I -35 I 

I CTG I GGG I CGC I CCT I CTG I GTA I AGG I TTG I GG- 
gac ccc gcg gga gac cat tec aac cc- 

I -a.Q I 

ATA ATA CAG TAg caa ccc tct = HIV 1024-1044 
A|ATa|ATA|cAgl£ag|caa|ccc|tct|A 3'0lig#452 
t taT tat GtC ATC GTT GGG AGA t teg a 5 " 
I Second Target | |Hind3 [ 



We replace the target in the amp promoter for aalT.K 
of PEP2010 with synthetic olig#454 and olig#456 between 
Aoa l and Xba l sites. The new plasmid is named pEP2011 and 
confers Gal"*" on HBIOI. pEP2011 contains the second target 
20 in both selectable genes and is ready for introduction of a 
variegated pdbp gene and selection of cells expressing 
polypeptides that can selectively bind the target DNA 
subsequence. 

25 Second Target and Pamp that promotes oalT.K 

5» |ctt|cta|aat|aca|ttc|aaa1 
01ig#454 3' c egg gaa gat tta tgtjaag ttt| 
I ApaT I I -?5 I 

30 

I TAT I GTA I TCC ] GCT | CAT | GAG 1 ACA | ATA | ACC | CT 
ata cat agg cga gta etc tgt tat tgg ga 

35 . 
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ATA ATA CAG TAg caa CCC tct = HIV 1024--1044 
T|ATa|ATA|cA3ltag|caa|cccItct| CGT 3«01ig#456 
a taT tat GtC ATc GTT GGG AGA gca gat c 5« 
I Second Target | | | 

5 

Variegation of the first eleven residues of the 
potential DNA*-binding polypeptide is illustrated in Table 
210. Double-stranded DNA having appropriate cohesive ends 
is prepared from olig#460 and olig#461, Klenow fragment^ 

10 BstEII, and BSU36 I. This DNA is ligated into similarly 
digested backbone DNA from pEP2011; the resulting plasmid 
is denoted vgl-pEP2011. Delta4 cells are transformed and 
selected with fusaric acid and galactose. Table 211 shows 
the sequence of a 37 cuaino-acid polypeptide isolated from 

15 cells exhibiting the DBP'*' phenotypes by the above hypothe- 
tical selection. The sequence shown in Table 211 is 
hypothetical and is given by way of example* This example 
must not be construed as a prediction that this sequence 
has specific affinity for the target or any other DNA 

20 sequence* Further variegation (vg2, vg3, . . . ) of this 
peptide and selection for binding to Target#2 will be 
needed to obtain a peptide of high specificity and affinity 
for Target#2. 

25 We anticipate that Successful DBP production will take 

more than three or four cycles of variegation and selec- 
tion, perhaps 10 or 15. We anticipate that initial phases 
will require careful adjustment of the selective agents and 
IPTG because the level of repression afforded by the best 

30 polypeptide may be quite low. As stated, we expect that 
biophysical methods, such as X-ray diffraction or NMR, 
applied to complexes of DNA and polypeptide will yield 
itaportant indications of how to hasten the forced evolu- 
tion. 

35 

The length of the polypeptide in the example may not 
be optimal; longer or shorter polypeptides may be needed. 
It may be necessary to bias the amino acid composition more 
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toward basic amino acids in initial phases to obtain some 
non-specific DNA binding* Inclusion of numerous aromatic 
amino acids (W,F,Y,H) may be helpful or necessary. 

5 Other strategies to obtain polypeptides that bind 

sequence-specifically are illustrated in examples 3, 4, 
and 5. 

10 Example 3 

We present a second example of the application of our 
selection method applied to the generation of asymmetric 
DBPs, A possible problem with making and using DNA*-binding 

15 polypeptides, is that the polypeptides may be degraded in 
the cell before they can bind to DNA. That polypeptides 
can bind to DNA is evident from the information on se- 
c[uence-specif ic binding of oligopeptides such as Hoechst 
33258. Polypeptides composed of the 20 common natural 

20 amino acids contain all the needed groups to bind DNA 
sequence-specifically. These are obtained by an efficient 
method to sort out the sequences that bind to the chosen 
target from the ones that do not. To overcome the tendency 
of the cells to degrade polypeptides, we will attach a 

25 domain of protein to the variegated polypeptide as a 
custodian. The first example of a custodial domain 
presented is residues 20-83 of barley chymotrypsin inhib- 
itor. 

30 The strategy is to fuse a polypeptide sequence to a 

stable protein, assuming that the polypeptide will fold up 
on the stable domain and be relatively more protected from 
proteases than the free polypeptide would be. If the 
domain is stable enough, then the polypeptide tail will 

35 form a make-shift structure on the surface of the stable 
domain, but when the DNA is present, the polypeptide tail 
will quickly (a few milliseconds) abandon its former 
protector and bind the DNA. The barley chymotrypsin 
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inhibitor (BGI-2) is chosen becaus it is a very stable 
domain that does not depend on disulfide bonds for stabil- 
ity. We could attach the variegated tail at either end of 
BCI-2. A preferred order of amino acid residues in the 
5 chimeric polypeptide is: a) methionine to initiate transla- 
tion, b) BCI-2 residues 20-83, c) a two residue linker, d) 
the first ten residues of Arc, and e) twenty- four residues 
that are varied over two amino acids at each residue. The 
linker consists of G-K. Glycine is chosen to impart 
10 flexibility. Lysine is included to provide the potentially 
important free amino group formerly available at the amino 
terminus of the Arc protein. The first target is the same 
as the first target of Example 2. 

15 Table 300 shows the sequence of a gene encoding the 

required sequence. The ambiguity of the genetic code has 
been resolved to create restriction sites for enzymes that 
do not cut pEPlOOS outside the rav gene. This gene could 
be synthesized in several ways, including the method 

20 illustrated in Table 301 involving ligation of oligonucleo- 
tides 470-479. Plasmid p£P3000 is derived from p£P2004 by 
replacement of the arc gene with the sequence shown in 
Table 300 by any appropriate method. 

25 Table 302 illustrates variegated olig#480 and olig#481 

that are annealed and introduced into the CI2-arcf l-lO) 
gene between PpuM I and Kpn l to produce the plasmid popula- 
tion vgl-pEP3000. Cells transformed with vgl-pEP3000 are 
selected with fusaric acid and galactose in the presence of 

30 IPTG. Further variegation (vg2, vg3, ...) will be required 
to obtain a polypeptide sequence having acceptably high 
specificity and affinity for Target#l. 

35 Example 4 

We present a second strategy involving a polypeptide 
chain attached to a custodial domain. in this strategy, 
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the custodial domain contains a DNA-recognizing element 
that will be xploit d to obtain quicker convergence of the 
forced evolution. 

5 The three alpha helices of Cro fold on each other. 

It has not been observed that these helices fold by them- 
selves, but no efforts in this direction have been report- 
ed. We will attach a variegated segment of 24 residues to 
residue 35 of Cro (H35 is the last residue of alpha 3). 

10 The target will be picked to contain a good approximation 
to the half 0^3 site at one end but no constraint is placed 
on the bases corresponding to the dyad-related other half 
of Or3. a sequence that departs widely from the Or3 
sequence is actually preferred, because this discourageis 

15 selection of a novel dimeric molecule. We assume that 
alpha-3 forms and binds to the same four or five bases that 
it binds in 0^3 .and that a polypeptide segment attached to 
the carboxy terminus of alpha-3 can continue along the 
major groove. We attach 24 amino acids of polypeptide 

20 immediately after the last residue of alpha-3, wherein the 
polypeptide is chosen: a) to have more positive charge than 
negative charge, b) to have beta chain predominate, c) to 
have some aromatic groups, and d) to have some H-bonding 
groups, produces a population that is then cloned and host 

25 cells are selected for expression of a polypeptide that 
binds preferentially to the target sequence. 

We first construct a hybrid target sequence (Target 
#3) containing one 0^3 half-site fused to a portion of the 

30 final target. This hybrid target DNA subsequence is 
inserted into the selectable genes in the same manner as 
the arc operator was inserted in Example 2. We then follow 
the same procedure to vary the 24 residues; first we vary 
twenty-four residues, using two possible amino acids at 

35 each residue. We carry out two or more cycles of such 
diffus variegation. Th n we vary 12 residues, using 4 
possible amino acids at each residue. We do two r more 
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iterations of this proc ss so that all residues are varied 
at least once. 

We have now generated one or more DBFs that bind well 
5 to one half of the final target sequence. Next we generate 

binding to the other half of the final target. First we « 
replace both instances of Target #3 ^ with the final target 
sequence, target #4. We then vary the alpha helix 3 and 
the surface of the hypothesized domain formed by helices 1- 
10 3 to optimize binding to final target sequence. 

A search of the non-variable regions of. the HIV-1 
genome reveals that bases 624-640 (a^T£t£TAGCAGTGGCG) 
contain a good match to one half of Oj^3, as shown in Table 
15 400. As first target of this example, we choose 

T^TCCCTAGCAGTGGCG , 

denoted Target#3, that has one half of 0|^3 and nine bases . 
20 from Hiy-1. Once a sequence is obtained that binds 
Target#3, we replace Target#3 by Target#4 = HIV 624-640 and 
variegate the recognition helices taken from Cro, 

To engineer Target #3 into Pneo that regulates tet , 
25 plasmid pEP2002 is digested with StuI and Hind lll and the 
purified backbone is 1 igated to an annealed , equimolar 
mixture of olig#490 and olig#492. Delta4 cells are trans- 
formed and selected with Tc; replacement of the arc 
operator relieves the repression by Arc. Plasmid DNA from 
30 Tc^ colonies is sequenced to confirm the construction; the 
construction is called pEP4000. 

Target #3 and Pneo that promotes tet 

35 5' |CCT[GCG|AACfCGG|AAT|TGClCAG|- 

orig#490 « 3' gga cgc ttg gcc tta acg gtc- 

I StuI I I -35 I 
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I CTG I GGG I CGC I CCT I CTG I GTA fAGG I TTG I GG- 
gac ccc gcg gga gac cat tec aac cc- 

5 aAT etc TAG CAG TGG CG = HIV 624-640 

A I TAT I C£C I TAG | CAG | TGG | CGA 3 ' 01ig#4 92 

t ata ggg ate gtc acc get teg a 5 • 
J Target *3 | |Hind3 | 

10 

We engineer the seeond instanee of the target, in 
like manner^ into Pamp foi^ aalT,K . using Apa l and Xba X to 
digest pEP4000 and olig#494 and olig#496. HBlOl cells 
(galK") are transformed and are selected for ability to 
15 grow on galactose as sole carbon source* Plasmid DNA froia 
Gal**" colonies is sequenced in the region of the insert to 
confirm the construction. The plasmid is called pEP4001, 



20 Target #3 and Pamp that promotes aalT,K 

5» |ctt|cta|aat|aca|ttc|aaa| 
01ig#494 3' c egg gaa gat tta tgtjaag ttt| 
I Apal. I J -?5 I 

25 

|tat|gta|tcc|gct1cat|gag|aca|ata1acc| 
ata cat agg cga gta etc tgt tat tgg 

I "^Q I 

30 

|CTT|IAT|CCC|TAG|CA6|TGG|CG CGT 3»01ig#496 
gaa ata ggg ate gtc acc gc gea gat e 5* 
I Target 43 I \ Xbal | . 

35 A gene fragment encoding the first two helices of Cro 

is shown in Table 401. 01ig#483 and olig#484 are synthe- 
sized and extended in th standard manner and the DNA is 
digested with BstEII and KpnI. This DNA is ligated to 
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backbone from pEP4001 that has been digested with BstE II 
and Kpn l; the resulting plasmid, denoted pEP4002, contains 
the Target #3 siibsequence in both selectable genes and is 
ready for introduction of a variegated pdbp gene between 
5 Bal ll and ^nl- Table 402 shows a piece of vgDNA prepared 
to be inserted into the Bal ll -Kpn l sites of pEP4002. Table 
403 shows the result of a selection of delta4 cells , 
transformed with vgl-pEP4002, with fusaric acid and 
galactose in the presence of IPTG. Additional cycles of 
10 variegation of residues 36-61 are carried out in such a way 
that the amino acid selected at the previous cycle is 
included. After several cycles in which 22-24 residues are 
varied through two possible amino acids, we choose 10-13 
amino acids and vary them through four possibilities. 

15 

Once reasonably strong binding to Target#3 is ob- 
tained, we replace Target#3 with Target#4 and vary the 
residues in helix 3 (residues 26-35) and, to a lesser 
extent, helix 2 (residues 16-23) . 

20 

Example 5 

We disclose here a method of engineering a polypeptide 
25 extension onto the amino terminus of P22 Arc, a natural 
DBP, so that the novel DBP develops asymmetric DNA-binding 
specificity for a subsequence found in the HIV-1 genome. 
Others have observed that loss of arms from natural DBPs 
may cause loss of binding specificity and affinity (PAB082a 
30 and ELIA85) , but none, to our knowledge, have suggested 
adding arms to natural DBPs in order to enhance or alter 
specificity or affinity. The new construction is denoted a 
"polypeptide extension DBP"; the gene is denoted ESd and 
the proteins are denoted Ped. Wild-type Arc forms dimers 
35 and binds to a partially palindromic operator. We will 
generate a sequence of DBPs descendent from Arc. Early 
members of this family will form dimers, but will have 
suffici nt binding area such that asymmetric targets will 
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be bound. In final stag s of the development, proteins 
that do not dimerize will be engineered. 

Table 200 shows the sysunetric consensus of left and 
5 right halves of the P22 arc operator, arcO. Table 500a 
shows a schematic representation of the model for binding 
of Arc to arcO that is supported by genetic and biochemical 
data (VERS87b) . Arc is thought to bind B-DNA in such a way 
that residues 1-10 are extended and the amino terminus of 
10 each monomer contacts the outer bases of the 21 bp operator 
(RT Sauer, public talk at MIT, 15 September 1987) . 

Arc is preferred because: a) one end of the polypep- 
tide chain is thought to contact the DNA at the exterior 

15 edge of the operator, and b) Arc is quite small so that 
genetic engineering is facilitated. P22 Mnt is also a good 
candidate for this strategy because it is thought that the 
amino terminal six residues contact the mnt operator, mntO , 
in substantially the same manner as Arc contacts arcO. Mnt 

20 has significant (40%) sequence similarity to Arc (VERS87a) • 
Mnt forms tetramers in solution and it is thought that the 
tetrainers bind DNA while other forms do not. When the mnt 
gene is progressively deleted from the 3' end to encode 
truncated proteins, it is observed that proteins lacking 

25 K79 and svibsequent residues have lowered affinity for mnto 
and that proteins lacking Y78 and subsequent residues can 
not form tetramers and do not bind DNA sequence-specif ical- 
ly (KNIG88) . Some truncated Mnt proteins of 77 or fewer 
residues form dimers, but these dimers do not present the 

30 DNA-recognizing elements in such a way that DNA can be 
bound. Arc is preferred over Mnt because Arc is smaller 
and because Arc acts as a dimer. 

Other natural DBFs that have DNA-recognizing segments 
35 thought to interact with DNA in an extended conformation 
(referred to as arms or tails) and thought to contact th 
central part of the operator, such as X Cro or \ cl 
repressor, are less useful. For these proteins to be 
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lengthened enough to c ntact DNA outsicle the original 
operator, several residues would be needed to span the 
space between the central bases contacted by the existing 
terminal residues and the exterior edge of the operator. 

5 

Table 500a illustrates interaction of Arc dimers with 
arcO ? the two "C"s of Arc represent the place, near residue 
FIG, at which the polypeptide chain ceases to make direct 
contact with the DNA and folds back on itself to form a 

10 globular domain, as shown in Table 500b and Table 500c, 
Which of these alternative possibilities actually occurs 
has not been reported. Our strategy is compatible, with 
some alterations, with either structure. In Table 500b, 
each set of residues 1-10 makes contact with a domain 

15 composed of residues 11-57 of the same polypeptide chain; 
the dimer contacts are near the carboxy terminus. Table 
500c shows an alternative interaction in which residues 1- 
10 of one polypeptide chains interact with residues 11-57 
of the other polypeptide chain; the dimer, contacts occur 

20 shortly after residue 10. The similarity of sequences of 
Arc and Mnt, the demonstration of function of DNA-recogniz- 
ing segments transferred from Arc to Mnt (RT Sauer, public 
talk at MIT, 15 September 1987 and Knight and Sauer cited 
in VERS86b) , and the behavior of Mnt on truncation suggest 

25 that Table 500b is the correct general structure for Arc, 
but the structure diagreunmed in Table 500c is also pos- 
sible. 

Table 501 shows the four sites at which one of the 
30 consensus arc half operators comes within one base of 
matching ten bases (six unambiguous and four having two- 
fold ambiguity) in the non-variable segments of HIV-l DNA 
sequence, as listed in Example 1. The symbol "@" marks 
base pairs that vary among different strains of HIV-1. 
35 Because we intend to extend Arc from its amino terminus, we 
seek subsequences of HIV-1 that: a) match one of the arc 
half operators, and b) have non-variable sequences located 
so that an amino-terminal extension of the Arc protein will 
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interact with non-variable DNA. The subsequences 1024-1033 
and 4676-4685 me t this requirement while the subsequences 
at 1040-1049 and 2387-2396 do not. In the case of 1040- 
1049, the amino-terminal extension would proceed in the 3' 
5 direction of the strand shown and would reach variable DNA 
after two base pairs'. For 2387-2396, variable sequence is 
reached at once. The subsequence 1024-1033 is preferred 
over the sxibsequence 4676-4685 because it is much closer to 
the beginning of transcription of HIV so that binding of a 
10 protein at this site will have a much greater effect on 
transcription. In the remainder of this example, positions 
within the target DNA sequence will be given the number of 
the corresponding base in HIV-1. Base A1034 of HIV-1 is 
aligned with the central base of arcO . 

15 

HIV 1024-1044 has only three bases in each half that 
are palindromically related to bases in the other half by 
rotation about base pair 1034: Aio24/^1044' ^1026/^1042' 
and Gio32/^1036* "^^^ latter two base pairs correspond to 

20 positions in arcO that are not palindromically related. 
Five of the six palindromically related bases of arcO 
correspond to non-pal indromically related bases in HIV 
1024-1044. Thus no dimeric protein derived from Arc is 
likely to bind HIV 1016-1046 if symmetric changes are made 

25 only in the residues 1-10 (or in any other set of residues 
originally found in Arc) . Our strategy is to add, in 
stages, eleven varietfated residues at the amino terminus 
and to select for specific binding to a progression of 
targets, the final target of the progression being bases 

30 1016-1037 of HIV-1. Because the region of protein-DNA 
interaction is increased beyond that inferred for wild-type 
Arc- arcO complexes, unfavorable contacts in bases aligned 
with the right half of arcO can be compensated by favorable 
contacts of the polypeptide extension with bases 1016-1023. 

35 The penultimate selection isolates a dimeric protein that 
binds to th HIV-1 target 1016-1037; th ultimate selection 
isolates a protein that does not dimerize and binds to the 
same target . 
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Table 502 shows a progr ssion of target sequences 
that leads from wild-type arcO to HIV 1016-1037. It is 
emphasized that finding a subseguehce of HIV-1 that has 
5 high similarity to one half of arcO is not necessary; 
rather, use of this similarity reduces the number of steps 
needed to change a sequence that is highly similar to arcO 
into one that is highly similar or identical to an HIV-1 
subsequence. Reducing the number of steps is useful, 

10 because, for each change in target, we must: a) construct 
plasmids bearing selectable genes that include the target 
sequence in the promoter region, b) construct a variegated 
population of ped genes, and c) select cells transformed 
with plasmids carrying the variegated population of ped 

15 genes for DBF"*" phenotype. 

in sections (a), (c) , (e) , and (g) of Table 502, 
bases in the targets are in upper case if they match HIV 
1016-1046 and are xinderscored if they match the wild-type 
20 arcO sequence. 

We construct a series of plasmids, each plasmid 
containing one of the target sequences in the promoter 
region of each of the selectable genes. For each target, 

25 we variegate the ped gene and select cells for phenotypes 
dependent on functional DBFs. For each target, several 
roiinds of variegation and selection may be required. We 
anticipate that a plurality of proteins will be obtained 
from independent isolates by selection for binding to one 

30 target. We pick the protein that shows the strongest in 
vitro binding to short DNA segments containing the target 
as the parental Fed to the next round of variegation and 
selection. Genetic methods, such as generation of point 
mutations in the ped gene or in the target and selection 

35 for function or non-function of Ped can be used to deter- 
mine associations between particular bases and particular 
residues (V£RS86b) • 
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Once a Ped with specific binding for the target is 
obtained, it may be useful to determine a 3D structure of 
the Ped-DNA complex by X-ray diffraction or other suitable 
means. Such a structure would provide great help in 
5 choosing residues to vary to improve binding to a given 
target or to an altered target. 

We initiate development of a polypeptide extension 
DBP having affinity for HIV 1016-1037 by generating a 

10 variegated population of Peds and selecting for binding to 
the first target. Table 502a shows the first target which 
we designed to have identity to arcO in the left half, but 
to have a mismatch (arcO vs. target) at A^oss (which is C 
in the corresponding position in the right half of arcO and 

15 is palindromically related to a G in the left half ) ; the 
rationale is as follows. Vershon et al. (VERS87b) report 
that chemical modification with dimethyl sulfate of the 
wild-type CG at this location interferes mildly with 
binding of Arc and that this location is strongly protected 

20 from modification by dimethylsulfate if Arc is bound to the 
operator. Thus we expect a mismatch between wild-type arcO 
and the first target at A^oss ^o make wild-type Arc bind 
poorly. Binding can be restored, however, by favorable 
contacts to bases 1021-1023 by the amino-terminal exten- 

25 sion. 

An alternative first target would have C^osS' does 
arcO at the corresponding location, and ^X041^ unlike arcQ 
or HIV-1. Vershon et al. (VERS87b) report that methylation 
30 of the corresponding CG base pair strongly interferes with 
binding of Arc. Thus, changing the base that corresponds 
to HIV 1041 should have a strong effect on binding of Arc 
to the alternative target . 

35 In the first variegation step, we extend Arc by five 

variegated residues at the amino terminal. Since five 
r sidues can contact no more than thre bas s in a se- 
quence-specific manner, we limit the xtent of the target 
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to iihose bases that correspond to HIV 1021-1044. Inclusion 
of bases corresp nding t HIV 1016-1020 at this initial 
stage might position the target too far downstream from the 
promoters of the selectable genes to allow strong repres- 
5 sion of these promoters. Once a Fed displaying binding to 
bases corresponding to 1021-1044 has been isolated, we can 
introduce a greater length of the HIV-1 sequence into the 
left side of the target without concern that the Fed will 
bind too far downstream from the promoter of the selectable 
10 genes to block transcription. Furthermore, once binding by 
the amino terminal extension has been established, we can, 
in a stepwise manner, remove the right half of arco from 
the target, thereby forcing more asymmetric binding to the 
left half of arco and the bases upstream of 1024* 

15 

The first target is engineered into both selectable 
genes as in Example 2. We use olig#501 and olig#502, shown 
in Teible 503 , to introduce the first target downstream of 
Pneo that promotes tet, replacing arcO in pEP2002; the 
20 resulting plasmid is called pEFSOOO. From pEFSOOO; we use 
olig#503 and olig#504 to conistruct pEPSOlO in which the 
first target replaces arcO downstream of F^ap that promotes 
qalf f,K. 

25 Table 502b shows schematically how the amino terminal 

residues align to the first target; the five residue 
extension is unlikely to contact more than 3 base pairs 
upstream from base 1024. The alteration in the right half 
operator prevents tight binding unless the additional 

30 residues make favorable interactions upstream of 1024. 
Care is taken in designing the two instances of the target 
that the downstream boundaries are different, AAG in F^eo 
and CGT in Pamp- Thus, for the novel DBF to bind specifi- 
cally to both instances of the target, it must recognize 

35 the common sequence upstream of base 1024. 

An initial variegated ped is constructed using 
olig#605, as shown in Table 504, and comprises: a) a 
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methionine codon to initiate translation^ b) five vari- 
egated codons that each allow all twenty possible amino 
acids, and c) the Arc sequence from 101 to 157, (Because 
we are constructing a polypeptide extension at the amino 
5 terminus, we have added 100 to the residue nximbers within 
Arc so that Arc residue 1 is designated 101.) This vari- 
egated segment of DNA comprises (2^)5 « 2^5 = 3.2 x lo'^ 
different DNA sequences and encodes 20^ = 3.2 x 10^ 
different protein sequences; with the given technical 

10 capabilities, we can detect each of the possible protein 
sequences. The 3* terminal 20 bases of olig#605 are 
palindromically related so that each synthetic oligo- 
nucleotide primes itself for extension with Klenow enzyme - 
The DNA is then digested with Bsu 36I and Bst EII and is 

15 ligated to the backbone of appropriately digested pEPSOlO 
which bears the first target in each selectable gene. 
Transformed delta4 cells are selected for Fus^ Gal^ at low, 
medium, and high concentrations of IPTG, the inducer of the 
lacUVS promoter that regulates ped. Because the first 

20 target is quite similar to arcO , we anticipate that a func- 
tional Ped will be isolated with low-level induction of 
the ped gene with IPTG. 

More than one round of variegation and selection may 
25 be required to obtain a Ped with sufficient affinity and 
specificity for the first target. Function of a Ped is 
judged in comparison to the protection afforded by wild- 
type Arc in cells bearing pEP2002. Specifically, strength 
of Ped binding is measured by the IPTG concentration at 
30 which 50% of cells survive selection with a constant 
concentration of galactose or fusaric acid, chosen as a 
standard for this purpose. A Ped is deemed acceptable if 
it can protect cells against the standard . concentrations of 
galactose and fusaric acid, administered in separate 
35 tests, with an IPTG concentration of 5 x 10*"^ M. Prefer- 
ably, a Ped can prot ct cells against the standard concen- 
trations of galactose and fusaric acid, t sted s parately, 
with no more than ten times the concentration of IPTG 
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needed by pEP2 002 -bearing cells. Variegation of residues 
101, 102, and others may be needed. We anticipate that a 
plurality of independent functional Feds will be isolated; 
we discriminate among these by measuring In vitro binding 
5 to DNA oligonucleotides that contain the target sequence. 
The amino-acid sequences of different isolates are com- 
pared; residues that always contain only one or a few kinds 
of amino acids are likely to be involved in sequence- 
specific DNA binding. Table 505 shows a hypothetical 
10 isolate, Ped-6, that binds the first target. 

Table 502 c shows the changes between the first target 
and the second target. Three changes are made left of 
center to make the target more like HIV 1016-1042. Only 

15 the change Giq3o->C affects a base that is palindromically 
related in arcO . One change is made right of center that 
makes the target more like HIV 1016-1042, less like argo, 
and less palindromically symmetric. Furthermore, the 
target is shortened on the right by two bases so that 

20 selection isolates proteins that bind asymmetrically to the 
left side of the target. Starting with pEP2002, we 
introduce, in two genetic engineering steps that use 
olig#541, olig#542, olig#543 and olig#544 (Table 506), the 
second target (in place of arcO) into the promoter region 

25 of each selectable gene; the resulting plasmid is denoted 
PEP5020. 

Table 507 shows a variegated sequence that is ligated 
into pEP5020 between BstE II and Bsu 36I^ Variegated codbns 
30 are shown in the same way as in Table 204. 

Table 502d illustrates that residues 100-110 of Ped-6 
contact the bases of the second tcirget that differ from the 
first target. Accordingly, residues 1 and 96-99 of Ped are 
35 not variegated in the DNA shown in Table 507; rather, 
residues 100-110 are each varied through four possib- 
ilities, always including the amino acid previously present 
at that residue. This generates 4-^^ = 2^^ = approx. 4 x 
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10^ different DNA and protein sequ nces« Selection of 
transformed delta4 cells for Fus^ Gal^ and screening by in 
vitro DNA binding yields, by hypothesis, a plasmid coding 
on expression for the protein Ped-6-2, illustrated in 
5 Table 508. 

An alternative to the variegation shown in Table 507 
is one in which we vary residues 101-105, 108, and 110 
through eight possibilities each, yielding 2,0 x 10^ DNA 

10 and protein sequences. These residues, except KlOl, are 
indicated to be in contact with the operator. MlOl has 
been altered by the attachment of the polypeptide extension 
and thus should be altered. After variegation of the 
listed residues and selection, further variegation should 

15 include some variegation of residues 96-103 because changes 
in the listed residues may change the context within which 
residues 96-103 contact the DNA. 

More than one round of variegation and selection may. 
20 be required to obtain a Fed having sufficient affinity and 
specificity for the second target. 

Table 502e shows the changes from the second target to 
the third, which comprise: a) inclusion of bases 1018-1020, 

25 b) one change to the left of the 21 bp arcO region, c) two 
changes at the center of the arcO region, d> two changes 
left of center, and e) removal of bases 1041 and 1042. All 
of these changes make the third target less symmetric and 
more like HIV 1016-1040. The third target is introduced 

30 into each of the selectable genes in the same manner as the 
second target. The resulting plasmid, obtained in two 
genetic engineering stepis, is denoted pEP5030. Table 502f 
shows that residues 96-110 are all potential sites to 
alter the specificity and affinity of DBFs derived from 

35 Fed-6-2. Thus, in Table 510, we illustrate a segment of 
variegated DNA that comprises 2^0 = lo^ DNA sequences and 
needing on expression 10^ protein sequences having ten 
residues varied through two possibilities and five residues 
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through four possibilities. The DNA is then digested with 
BstE II and Bgu36I and ligated into pEP5030. Transformed 
delta4 cells are selected for Fus^ Gal^. By hypothesis, we 
isolate a plasmid, denoted pEP5031, that codes on expres* 
5 sion for the protein Ped-6-2-5 shown in Table 509. 

Table 502g shows the chcmges between the third and 
fourth targets. The changes are: a) inclusion of bases 
1016-1017, b) two changes right of center, and c) removal 

10 of bases 1038-1040 • The initial variegation to be selected 
using the f ouirth target consists of an extension of six 
residues at the amino terminus of Ped-6-2-5, shown in Table 
511. In iterative steps of forced evolution of proteins, 
one should not produce a number of different DNA sequences 

15 greater than the nximber of independent transformants that 
one can obtain (about 10^ with current technology) . 
Because there are no residues corresponding to 90-95 in the 
parental DBP (Ped-6-2-5), the first variegation and 
selection with the fourth target is a non-iterative step 

20 and it is permissible to produce 10^^ DNA sequences and 6.4 
X lo"^ protein sequences. In subsequent iterative rounds of 
variegation, the nximber of variants is, preferably, 
limited to a fraction, e.g. 10%, of the niomber of indepen- 
dent transformants that can be generated and subjected to 

25 selection. A protein, illustrated in Table 512 and denoted 
Ped-6-2-5-2, is isolated, by hypothesis, through selection 
of a variegated population of transformed cells for Fus^ 
GalR. 

30 Ped-6-2-5-2 binds specif ically to HIV 1016-1037 as a 

dimer . HIV 1016-1037 has no palindromic symmetry. 
Binding to an asymmetric DNA sequence by a dimeric protein 
is possible because the Ped-6-2-5-2 dimer has more recogni- 
tion elements than wild-type P22 Arc dimer and so can bind 

35 even though nearly half of the right half of arcO has been 
removed from the target. Ped-6-2-5-2 is useful as is; 
nev rtheless> obtaining a mohomeric protein may have 
advantages, including: a) higher affinity for the target 
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because suboptimal interactions are eliminated,- and b) 
lower molecular weight. Obtaining a fxinctional monomeric 
Ped is easiest if Arc dimers interact in the manner shown 
in Table 500b. We use the following steps to isolate a 
5 protein that binds specifically to HIV 1016-1037 as a 
monomer. 

Ped-6-2-5-2 is the parental DBP from which we derive 
the monomeric DBP. The route taken from a palindromically 
10 symmetric arcO sequence to an asymmetric HIV sequence was 
designed to select for binding to the left half of the 
original arc operator. 

Proteins that do not dimerize, but that bind speci- 

15 fically to the fourth target can be generated in several 
ways. Because the 3D structure of Arc is still unknown, we 
can not use Structure-Directed Mutagenesis to pick residues 
to vsivy to eliminate dimerization. One way to obtain 
monomeric proteins is to use diffuse mutagenesis to vary 

20 all residues from 111 to 157 and select for proteins that 
can bind the target sequence. Another strategy is to 
synthesize the ped gene in such a way that numerous stop 
codons are introduced . This causes a population of 
progressively truncated proteins to be expressed. Table 

25 513 shows a segment of variegated DNA that spans the Bglll 
to Kpn l sites of the arc gene used throughout this example. 
This segment is synthesized with suitable spacer sequences 
on the 5' end. The extra "t" at the 3« end allows two such 
chains to prime each other for extension with Klenow 

30 enzyme. The ratios of bases in the variegated positions 
are picked so that each varied codon encodes about 35% of 
polypeptides to terminate at that position. Since we 
intend to determine how much the protein can be shortened 
and remain functional, we begin by replacing codon 153 with 

35 stop. Since 15 residues are varied, only about 0.3 % of 
chains will continue to stop codon 153 without one or more 
stop codons. All the intermediate length chains will be 
present in.th selection in detectable amount. delta4 
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cells transformed with pEP5030 containing this vgDNA are 
selected for Pus^ Gal^. Because each variegated codon 
causes translation termination in about 35% of the genes in 
the variegated population^ shorter coding regions are more 
5 eibundant than longer ones. Thus, the shortest gene that 
encodes a functional repressor will be the most abundant 
gene selected, Plasmid DNA from a number of independent 
selected colonies is sequenced. The dimerization proper- 
ties of several functional DBFs are tested In vitro and the 
10 sequence of the shortest monomer ic protein is retained for 
use and further study. 

In this manner, we generate a protein that binds 
monomerically to a DNA sequence that has no palindromic 
15 symmetry. 

Example 6 

20 We illustrate here the fusion of two known DNA-binding 

domains to form a novel DNA-binding protein that recognizes 
an asymmetric target sequence. The progression of targets 
is the same as shown in Table 502 (Example 5) . The amino- 
acid sequence of the initial DBP is illustrated in Table 

25 600 and comprises the third zinc-finger domain from the 
product of the Drosophila kr gene (ROSE86) , a short linker/ 
and P22 Arc. The linker consists of three residues that 
are picked to allow: a) some flexibility between the two 
domains, and b) introduction of a Kpn l site. The polypep- 

30 tide linker should not allow excessive flexibility because 
this would reduce the specificity of the DBP. 

The primary set of residues to vary to alter the DNA- 
binding are marked with asterisks. Those in the zinc 
35 finger were picked by reference to the model of Gibson et 
al. (GIBS88) ; all residues having outward-directed side 
groups ( xcept those directed upward from th beta strands) 
were picked^ Residues 101-110 (1-lb of Arc) were also 
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picked to be in the primary set. Other residues within the 
Arc sequence may be varied. For each target in the 
progression, we initially choose for variegation residues 
in the primary set that are most likely to abut that part 
5 of the target most recently changed. For example, for the 
first target, we begin by varying residues 21, 24, 25, 28, 
and 29, each through all twenty amino acids. , After one or 
more rounds of variegation and selection, other residues in 
the primary and secondary set are varied. 

10 

Other zinc-finger domains, such as those tabulated by 
Gibson et al. (GIBS88) , are potential binding domains. 
Other proteins with known DNA binding, such as 434 Cro, may 
be used in place of Arc. Multiple zinc fingers could be 
15 added, stepwise, to obtain higher levels of specificity and 
affinity. 
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TABLES 

Table 1 MISSENSE MUTATIONS IN X CRO 



10 



HEQRITLKDYAHRF 



/ \ / \ / V| alpha 1 I 
beta 111 I I 

I R D (L) 

S 



15 20 25 30 35 

• • • ♦ • 

GQTKTAKDLGVYQSAINKAIHAGR 

(K)R 

(F)C A N 
E P N L R Q (T) L 

R H D H N T (R)K) Q 

IJ . l-U I - LI I 

~| alph a 2 | • — — | alpha 3 f 

I I 1 I I I I 

P T HA L T T 

V F S V 

R T G 

P K 
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Table 1, continued 

40 45 50 55 60 65 

• • . • • • • 

KIFLTINADGSVYAEEVKPPPSNKKTTA 

G R NT 
Y A A Q V S 

T H(E)N (K)K N C L 

I III I I I I I 

/ \ / \ / \ / ~\ / \ / \ / \ / 

beta 2 beta 3 

II I I I 

F F(A) G T 

S V 
L ^ S 

M 

Notes: 

Sxibstltutions occurlng at solvent exposed positions in the 

unbound repressor dimer are shown above the wild type 
sequence. 

Substitutions occuring at internal positions are shown below 

the wild type sequence. 
Sxibsitutions that produce repressor dimers with normal or 

nearly normal DNA binding affinities are shown in 

parentheses. 
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Table 2: Examples of selections for plasmid uptake and 
maintenance in E. coli 



(alternate 
designation) 



fuTict:ion 



(Km^, neo) 
(Tc^^ tet) 
(Cm^, cat) 
colicin immunity 
TrpA*** 



Amp*^ 
Kan^ 
Tet^ 
Cam^ 



bet a-1 act amase 
aminoglycoside P-transf erase 
membrane pump 
acetyltransf erase 
binds to colicin in vivo 
complementation of trpA 



Table 3: Examples of selections for plasmid uptake and 
maintenance in S. cerevisiae 



function 

complements ura3 auxotroph 
complements trpl atixotroph 
complements leu2 auxotroph 
complements his3 auxotroph 
resistance to G41S 



gene 

Ura3+ 

Trpl+ 

Leu2+ 

Kis3+ 

Neo^ 
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Table 4 (continued) : Agents for Selection 
of DBP Binding in coli and R levant Genotypes 

Notes: . ^ 

> 

1) Deletions are* strongly preferred over point mu 
tions. 

2) Only sec A gene need be controlled by DBP, 

3) Mutations in crp are highly pleotropic ; some 
effects seen in cell vail, crp best used in conn 
tion with selections having intracellular action. 

4) Resistcince to colicins can arise in several wa 
use of two or more E-colicins discriminates again 
other mechanisms* Because colicins do not replic 
they are preferred over phage for selection. Pha 
are useful to verify selection of cells repressin 
expression of omoA , 

5) Because colicins do not replicate, they are 
preferred over phage for selection. Phage are us 
to verify selection of cells repressing expressio 
tsx . 
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Table 5: Some Reconanended Pairs o£ Selectable 
Binding Marker G nes 

A) Recommended pairs: 



aalT.K tetA | aalT.KoheS 

araP pheS | tetA thvA 

lacZ tetA I ptsM thyA 

dctA cvsK I ompA pyrF 

crp thyA I btuB pyrF 

lamB thVh I tonh q^JiTrK 

secA & pyrF | cir cysK 

malE ^ lacZ fusion 

isz, cysy I . aroP lacZ 

dctA thvA 

B) Less Preferred pairs: Reason 

tetA araP Both transport related. 

secA & lacZ Both related to lacZ 

malE - lacZ fusion function. 
PvrF thyA Both related to thymine 

lamB aalT.K Both related to sugar me 

bolism. 

cir tsx Both related to colicin 

ptsM tetA Both transport related 

tonA PtsM Both transport related 

crp lacZ Both related to sugar me 

bolism 
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Table 6 Promoters 

Ar Correlation between Sequence Homology and Promoter 
Strength (MaLL84) 



Promoter Homology 'score Log K gk^ 

T7 Al 74.0 7.40 



T7 A2 73.4 7.20 

X Pr 58.6 7.13 

lac UV5 59.2 6.94 

59*2 6.30 

T7 D 63.9 6.30 

63^9 6.00 

TnlO Pout 56.2 6.71 

TnlO Pin 52,1 6.18 

X PrM 49.7 4.71 

49.7 4.17 



Pamp 52.7 



Pneo 58.0 
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Table 6 (continued) Promot rs 
B: Sequences of some promoters 

Name -35 , [ , [ -10 ±1. 

T7 Al GTATTGACTTAAAG TCTAACCTATAGS&2&SITAC AGCC^ 

T7 A2 GTATTGACAACATG AAGTAACATGCA GTAAGATA CA AATCfi 

\ GT GTTGACTA TTTT ACCTCTGGCGGTiaSMTGGT TGCA 

lac UV5 GG CTTTACA CTTTA TGCTTCCGGCTC ATATAAT GTG TGG^. 

T7 D GC GTTGACT TGATG GGTCTTTATGT GTAGGCTT TA GGTG 

TnlO Pont GG GCAGAATT G6TA AAGAGAGTCGTGTAAAATATC GAG2 

TnlO Pin AGGTGGATACACAT CTTGTCATATG ATCAAAT GGT TTCG , 

X Prm tg ttagatat ttat cccttgcggtg atagatt taa cata 

Pamp AC ATTCAAAT ATGT ATCCGCTCATG AGACAATA AC CCTG 

Pneo GA ATTGCCA GCTGG GGCGCCCTCTG GTAAGGT TGG GAAG 
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Table 7 FUNCTIONAL SUBSTITUTIONS IN HELIX 5 OF X 

HEPRESSOR 
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Table 8: Some Preferred Initial DBFs 

X cl repressor 
X Cro 

434 cl repressor 
434 cro 
V22 Mnt 
P22 Arc 

P22 oil repressor 
X cli repressor 
X Xis 
\ Int 

CAMP Receptor Protein from E^. coli 
Trp Repressor from Ej^ coli 
Kr protein from Drosophila 

Transcription Factor IIIA from Xenopus laevis 

Lac Repressor from JEj_ coli 

Tet Repressor from TnlO 

Mu repressor from phage mu 

Yeast MAT-al-alpha2 

Polyoma Large T antigen 

SV40 Larige T antigen 

Adenovirus ElA 

Human Transcription Factor SPl (a zinc finger protein) 
Human Transcription Factor API (product of iun ) 



Table 9, Table 10, and Table 11 have been deleted. 
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ft 


•g 


(D 




(D 


(0 






H- 


a> 


a 






(D 


CO 




O 


3 






CO 


CO 






3 


O 








• 






CO 


CD 


o 



ft 

3* 



3 
O 

CD 

CO 
CD 
Oi 

D 

tr 

3 
& 

3 

M> 
hh 
H* 
3 



ft 

3- 

3 
O 



O 

3 
CD 

n 



3 

o 

I 





CO 




H- 


►1 


ft 




H- 




O 


CO 


3 


3* 


CO 


O 




< 


H- 


3 


3 


tr 


rt 


CD 


3* 




CD 






i 




ft 


§• 


3- 


o 


CD 


c 




3 




P< 






M 






CD 








H 




CD 




CO 


CD 


CO 




0 







O H H » as — 



cn 









> 




> 




tr* 


— 






> 
















H 












< 




CO 




< 














' — 


oa 








CO 




H 




> 




» 




W 




H 




Hi 










•d 




3* 




0? 






K 












> 




< 




CO 



^3 

tr 

H 

(D 

H 

to 



O 
O 
3 
ft 
H' 

i 

CD 
0* 



H 
CO 
CO 

CO 



H 

§ 

CO 
H 



— « 



*^ <^ t?d 

^ O K « CO 

o 



o \o 



H 



H 
> 

i 



H 

as 



CD 
CO 
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Table 13. HISSENSE MUTATIONS IN P22 ARC REPRESSOR 
THAT PRODUCE AN ARC" PHENOTYPE 



5 






10 




15 








20 




25 




31 


• 

M K G M S K M 


P 


Q 


• 

F N LR 


W 


• 

P R 


£ 


V 


L 


• 

D L 


V R 


• 

K V A 


E E 


« 

N G 


Jilqh Yield 






■ • 
















' « 




• 


Q R I C 
T K 
medium vield 


L 




C 
V 




• 


















R 

low vield 












A 


G 
£ 




A F 




G 


A 




R 

undetermined 




L 


S K W Q 
F W 

• 


R 


S 
L 

• 








N 

y 

• 


H 
C 


I T 
V 

• 


K 


K 
T 
Y 

• 


L 












G 




S 


V 

* 










35 






40 




45 








50 




55 






• 

R S V N S E I 


Y 


Q 


R V M E 


S 


• 

F K 


K 


£ 


G 


• 


G A 


• 






hiah vield ■ 




























medium vield 






















Y+S 

• - 






A 

low vield 






A 




T 


















W F G H AM 
L K K 


S 
D 


P 


Q A G 




S 
C 




K 




P 
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TABLE 14 

MISSENSE MUTATIONS AT SOLVENT EXPOSED POSITIONS 
OF THE H-T-H REGIONS OF REPRESSOR PROTEINS 

Table 14a 
\ Repressor 



(S) 

L (N) 
Y(K)T R' C 

35 40 45 50 55 

• * • * • 

HELIX 2 TORN HELIX 3 

QESVADKMGMGQSGVGALFNGINA 
* * * * * * 

SP EYL DD DDK. 

K L V 

L S 



Table 14b \ Cro 

F K 
K R T 



15 




20 25 






30 




35 






• 


HELIX 


• • _ 

2 TURN 






* 

HELIX 


3 


• 






G Q 


T K T 


A K D L G V Y 


Q 


S 


A I N 


K A I 


H A 


G R 


K 


* 




* * 


* 


* 


* 


* 


* 


* 


* 


R H 




D 


H 


N 




N 




Q 


T 


E P 




N 


L 


R 




T 




L 





wo 90/07862 



PCr/US90/00024 



191 

Table 14c 434 Repressor 

H 
L 
V 
T 
A 

20 25 30 35 

» • • • 

HELIX 2 TURN HELIX 3 

QAELAQKVGTTQQSIEQLEN 
* * * * 

A 
H 
L 
S 
M 
R 
P 
K 



Table 14d Trp Repressor 
T 

70 75 80 85 

• • • • 

HELIX 2 TURN HELIX 3 

QRELKNELGAGIATITRGSN 

* * * * 

S M C 

D H 
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Table 14, NOTES: 

Positions in wild type repressors believed to contact DNA are 
indicated by a * below the wild type residue. 
Substitutions that greatly decrease repressor binding to DNA 
are shown below the wild type sequence. 

Substitutions that produce repressors with normal or nearly 
normal DNA binding affinities are shbvne the wild type 
sequence. 

Substitutions that increase repressor affinity for DNA are 
shown in parentheses above the wild type sequence. 



Table 15: deleted. 
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Table 16: Genetic Code Table 
With Secondary-Structure 
Preferences 



• Second Base 

First Third 



Base 


m 




c 




■ h 




G 




base 




F 


b/a 1 


S 


a/b 1 


Y 


b 1 


C 


b 


|T 


1 


F 


b/a 1 


s 


a/b 1 


. Y 


b 1 


C 


b 


|c 


T 1 


L 


a/b 1 


s 


a/b 1 


stop 1 


stop 


1 A 




L 


a/b 1 


s 


a/b 1 


StOD 1 


w 


b/a 


1 G 

1 V7 




L 


m/b 1 


p 




H 


a/b 1 


R 


b/a 


|T 




L 


a/b 1 


p 




H 


a/b 1 


R 


b/a 


|C 


c 1 


L 


a/b 1 


p 




Q 


b/a 1 


R 


b/a 


|A 




L 


a/b 1 


p 




<? 


b/a f 


R 


b/a 


|G 




I 


b 1 


T 


b 1 


N 


a/b I 


S 


a/b 


|T 




I 


b 1 


T 


b 1 


N 


a/b 1 


S 


a/b 


|C 


A 1 


I 


b 1 




b 1 


K 


a/b 1 


R 


b/a 


|A 




M 


b 1 


T 


b 1 


K 


1 


R 


P/^ 


|G 




V 


b 1 


A 


a 1 


D 


a/b 1 


G 


b/a 


|T 




V 


b 1 


A 


a 1 


D 


a/b 1 


. G 


b/a 


|c 


G 1 


V 


b 1 


A 


a 1 


E 


a 1 


G 


b/a 


|A 




V 


b 1 


A 


a 1 




S I 


G 







Amino acids denoted '*b" strongly favor extended structures. 
Amino acids denoted "b/a»» favor extended structures. 
Amino acids denoted "a/b" strongly favor helical structures, 
Amino acids denoted "a" very strongly favor helices. 
Proline is denoted "-" and favors neither beta sheets nor 
helices. 



b: , I, M, V, T, Y, C 

b/a: F, Q, G, W 

a/b: L, S, H, N, K, D 

a: A, E 
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Table 17 

Fraction of DNA molecules having 
n non-parental bases when 
reagents that have fraction 
M of parental nucleotide. 

Number of bases using mixed reagents is 30, 



M .9965 


.97716 


.92612 


.8577 


.79433 


.63096 


fO .9000 


.5000 


.1000 


.0100 


^0010 


.000001 


fl .09499 


.35061 


.2393 


.04977 


.00777 


.0000175 


£2 .00485 


.1188 


.2768 


.1197 


.0292 


.000149 


f3 .00016 


. 0259 


.2061 


.1854 


.0705 


.000812 


f4 .000004 


.00409 


.1110 


,2077 


.1232 


.003207 


f8 0. 


2x10"'' 


.00096 


.0336 


. 1182 


.080165 


fl6 0. 


0. 


0. 


5xl0"7 


.00006 


.027281 


f23 0. 


0. 


0. 


0. 


0. 


.0000089 


ihos-b 0 


0 


2 


5 


7 


12 



fn is the fraction of all synthetic DNA molecules having n 
non-parental bases. 

"most" is the value of n having the highest probability. 
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Table 18: best vgCodon 

Program "Find Optimum vgCodon." 
INITIALI2E-MEM0RY-0F-ABUNDANCES 
DO ( tl = 0-21 to 0.31 in Steps of 0.01 ) 
. DO ( cl = 0.13 to 0.23 in steps of 0.01 ) 
. . DO ( al « 0.23 to 0.33 in steps of 0.01 ) 
Comment calculate gl from other concentrations 
. . . gl = 1.0 - tl -cl - al 
. . . IF( gl .ge. 0.15 ) 

.... DO ( a2 = 0.37 to 0.50 in steps of 0.01 ) 
. . . . . DO ( c2 = 0.12 to 0.20 in Steps of 0.01 ) 

Comment Force EH-E = r + k 

. g2 = (gl*a2 -.5*al*a2)/(cl+0.5*al) 

Comment Calc t2 from other concentrations. 
. . . . . • t2 = 1. - a2 - c2 - g2 

IF(g2.gt. 0.1. and. t2.gt.0.1) 

CALCUIATE-ABUNDANCES 

COMPARE-ABUNDANCES-TO-PREVIOUS-ONES 

. . . . . ... end_IF_block 

. end_DO_loop 1 c2 

.... . .end_DO_loop ! a2 

. . . . . end_IF_blocJc 1 if gl big enough 

. . . . end_DO_loop I al 

. . .end_DO_lobp ! cl 
. . end_IX)_loop ! tl 

imiTE the best distribution and the abundances. 
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Table 19: Abundances obtained 
from optimum vgCodon 



Amino Amino 





Abundance 


acid 


Abundance 


A 


4.80% 


c 


2.86% 


r\ 
u 








T 


2.86% 


G 


6.60% 


H 


3.60% 


I 


2.86% 


K 


5.20% 


i. 


6.82% 


M 


2.86% 


N 


5.20% 


P 


2.88% 


Q 


3.60% 


R 


6.82% 


§ 


7.02% mfaa 




4.16% 


V 


6.60% 


W 


2.86% Ifaa 


Y 


5.20% 


strop 


5.20% 







Ifaa — least- favored amino acid 
mfaa = most-favored amino acid 
ratio = Abun(W)/Ab;m(S) » 0.4074 



i fl/ratio^ j 

1 2.454 

2 6.025 

3 14.788 

4 36.298 

5 89.095 

6 218.7 

7 536.8 



f ratio) 3 

.4074 

.1660 

.0676 

,0275 

. 0112 
4.57 X 10"3 
1.86 X 10"3 



stop-free 
.9480 
.8987 
.8520 
.8077 
.7657 
.7258 
.6881 
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Table 20: Calculate worst codon. 
Program "Find worst vgCodon within Serr of given distribu- 
tion." 

INITIALIZE-MEMORY-OF-ABUNDANCES 

READ Serr Comment Serr is % error level. 

Comment Tli,Cli,Ali,Gli, T2i,C2i,A2i,G2i, T3i,G3i 
Comment are the intended nt-distribution. 

READ Tli, Cli, Ali, Gli 

READ T2i, C2i, A2i, G2i 

READ T3i, G3i 

Fdwn = l.-Serr 

Fup « l.+Serr 

DO ( tl = Tli*Fdwn to Tli*Fup in 7 steps) 
. DO ( cl « Cli*Fdwn to Cli*Fup in 7 steps) 
. . DO ( al = Ali*Fdwn to Ali*Fup in 7 steps) 
. . . gl = 1. - tl - cl - al 
... IF ( (gl-Gli)/Gli .It. -Serr) 
Comment gl too far below Gli, push it back 
. . . . gl = Gli*Fdwn 

.... factor = (l.-gl)/(tl + el + al) 
. . . . tl = tl*factor 
. . . . cl = cl*f actor 
. ... al = al*f actor 
. . . . . end_IF_block 
... IF ( (gl-Gli)/Gli .gt. Serr) 
Comment gl too far above Gli, push it back 
. . . . gl = Gli*Fup 

. . . .factor = (l.-gl)/(tl + cl + al) 
. . . . tl = tl*factor 
. . . . cl = cl*factor 
. . . . al = al*f actor 
. . . . . end_IF_block 

. . . DO ( a2 = A2i*Fdwn to A2i*Fup in 7 steps) 
. . . . DO ( c2 = C2i*Fdwn to C2i*Fup in 7 steps) 

DO (g2=G2i*Fdwn to G2i*Fup in 7 steps) 

Comment Calc t2 from oth r concentrations. 
. . . . . . t2 = 1. - a2 - c2 - g2 

IF( (t2-T2i)/T2i .It. -Serr) 
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Table 20, con-tinued: Calculate worst codon. 
Comment t2 too far below T2i, push it back 
t2 = T2i*Fdwn 

• factor = (l.-t2)/(a2 + c2 + g2) 

a2 = a2* factor 

. c2 = c2*factor 

....... g2 g2*factor 

. • end_lF_block 

IF( (t2-T2i)/T2i .gt. Serr) 

Comment t2 too far aJ3ove T2i, push it back 
t2 = T2i*Fup 

....... factor = (l.-t2)/(a2 + c2 + g2) 

a2 a2*f actor 
....... c2 - c2*f actor 

....... g2 = g2*factor 

• . . end_IF_block 

...... XF(g2.gt. 0.0 .and. t2.gt.0.0) 

t3 = 0.5* (1. -Serr) 

. g3 = 1* - t3 

....... CALCOIATE-ABUNDANCBS 

COMPARE-ABUNDANCES-TO-PREVIOUS-ONES 

. t3 = 0.5 

g3 = 1. - t3 

CALCULATE-ABUNDANCES 
....... COMPARE-ABUNDANCES-TO-PREVIOUS-ONES 

....... t3 = 0.5*(1.+Serr) 

. . . . . . . g3 = 1. - t3 

. . . . ... CALCULATE-ABUNDANCES 

....... COMPARE-ABUNDANCES-TO-PREVIOUS-ONES 

end_IF^block 

....... end_DO_loop I g2 

end_po_loop ! c2 

end^D0_loop ! a2 

• . . .end_DO_loop ! al 
^ . .end_D0_loop 1 cl 

. . end_po_loop i tl 

WRITE the WORST distribution and the abundances. 
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TeJsle 21: Abundances obtained 
using optimum vgCod n assuming 
5% errors 

Amino Amino 

ae^d Abundance ; asiA A^Mn^anc? 

A 4.59% C 2.76% 

D 5.45% E 6.02% 

F 2.49% Ifaa G 6.63% 

H 3.59% I 2.71% 

K 5.73% L 6.71% 

M 3.00% N 5.19% 

P 3.02% Q 3.97% 

R 7.68% mfaa S 7.01% 

T 4.37% V 6.00% 

W 3.05% Y 4.77% 

stop 5.27% 

ratio = Abuh(F)/Abun(R) = 0.3248 

1 (l/ratic^j fyatiQ)^ Stop-ff^e 

1 3.079 .3248 ^ .9473 

2 9,481 .1055 .8973 

3 29.193 .03425 .8500 

4 89.888 .01112 .8052 

5 276.78 3.61 X 10"^ .7627 

6 852.22 1.17 X 10"^ .7225 

7 2624.1 3.81 X 10"* .6844 



Table 22, deleted 
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Tables for Example 1 

Table 100: X Op3 Downstreaia 
of Pamp promotes aalT.K 

5* I GAT I CGT I TAA | CGG | GCC [ CTT | CTA \ AAT | ACA | TTC | AAA | - 
0lig#4 3' ea att gee caa aaa aat tt a tcrt{aaa 1:tt| - 

I Hoal I I Anal | | -35 [ 

I TAT I GTA I TCC | GCT | CAT | GAG | ACA | ATA | ACC | - 
a-ta ca-t aaa caa crta etc tat tat -tag- 

I -10 I 



I CTT I ATC I ACC | GCA | AGG \ GAT [ ATC ) TAG | AGT | C 3* = 01ig#3 
aaa tag tag cat: tec eta tag ate t 5' 
J 1-Qr3 — I I Xbal I 
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Table 101: \ 0^3 Downstream 
of Pneo that promotes tet 



5' I |CCT|GC G|AAC|CGG|AAT|TGClCAGl- 
Olig#6 3* gge cac tta ace tta acq atc- 

I fftuj I I I 



I CTG I GGG I CGC | CCT | CTG | GTA | AGG | TTG | - 
gac ccc acq aaa aac cat tec aac- 



I GGA I TAT I CAC | CGC | AAG | GGA | TA 3' = 01ig#5 

qgq at a ata acq ttc ect att eg a 5' 
I \ 0^ 3 |HindIIl| 
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Table 102: rav gene 
using lacUVS as promot r 

Spel- BstE II" f Bal l- PPXiMI - Bal ll- BaitiHI -Aval ) 

-Kpn l- f Trp terminator) -S£il ; I 



5 «-ACTAGT CCAGG C TTTACA CTT TATGC TTCCG GCTCG TATAAT GTGT GG 
AAT TGTGA GCGGA TAACA ATTTC ACAC I lacUVS 

A GGTAACG AGGAGGAAATAAA ! BstE II & Shine-Dalgamo seq. 

I BstE2 I 

m e q r i t l k d y a m r 
1 2 3 4 5 6 7 8 9 10 11 12 13 
ATG 6AA CAA CGC ATA ACC CTA AAG GAG TAG GCG ATG CGC 

f g q t ie t a k d 1 
14 15 16 17 18 19 20 21 22 23 
TTT G6C CAA ACC AAG ACA GC6 AAG GAC CTA 
|Bal X I |PDUM I I 

g y y q s a i n k'a i 
24 25 26 27 28 29 30 31 32 33 34 
GGG 6TG TAT CA6 A6C 6CG ATT AAC AAG GCC ATC 



hagrki fltin ad 
35 36 37 38 39 40 41 42 43 44 45 46 47 
CAT GCC 6GC CGA AAG ATC TTC CTA ACC ATT AAC GCT GAT 

|BqJ. JJ\ 
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Tabl 102, continued 
g s V y a e e v k p f p s 
48 49 50 51 52 53 54 55 56 57 58 59 60 

GGA TCC GTC TAG GCG GAA GAG GTA AAG CCC TTC CCG AGT 



n k k t t a . . . 
61 62 63 64 65 66 67 67 68 
AAC AAA AAA ACA AGA GCG TAA TAG TA GGTACC 

agtcta agcccgc ctaa-tga gcgggct tttttttt 1 terminator 



GGCCcgactGGCC 
I sti I I 



-3' 1 S£i I 
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Table 103 : Ca-baloaue of plasmids 



pEPlOOl 



pAAaH with 4.3 kbp deletion of \, Cl^ I 
site introduced. 



pEP1002 



p£P1003 



pEPlOOl with £^ terminator and Spe T, Sfi 
I, Hpa I cloning site distal to aalTJK. 

p£P1002 with P^ promoter replaced by pBR322 
amp promoter ( Pamp ) and Oj^3 upstream of 
aalT.K ; Pamp and Or3 bounded by Hpa I and 
Xba I and containing Apa I cloning site 
between Hpa I and Pamp, 



PEP10G4 



pKK175-6 with f Pamp . aalT.K . fd terminator, 
Spe I, Sfi I cloning site) from pEPlOOJ 



pEPlOOS 
pEPlOOe 
pEPlOO? 

pEPiooa 

PEP1009 
pEPlOlO 

pEPlOll 



pEP1004 with Tn5 neo promoter (Pneo) and 
Or3 bounded by Stu I and Hind III. 

pEPlOOS with BamH I site removed by site- 
specific mutation. 

pEPlOOe with r iacUVS . S.D., rav cloning 
site, trpa terminator) . 

pEPlOO? with N-terminal part of rav aene. 

pEPlOOS with complete rav gene. 

pEP1009 with Oj^3 replaced by scrambled 
0|^3 sequence. 

pEP1009 with 0^2 sequences replaced with 
the HIV 353-369 Left Symmetrized Target. 



PEP1012 



pEP1009 With 0^3 sequences r placed with 
the HIV 353-369 Right Symmetrized Target. 
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Table 103 fcontinued^ : Catalogue of plasmids 



pEPllOO to 
PEP1199 

PEP1200 to 
PEP1299 

PEP1301 

PEP1302 
PEP1303 
PEP1304 

PEP1305 

PEP1306 

PEP1307 

PEP1400 to 
PEP1499 

pEPlSOO to 
PEP1599 



pEPieOO to 
PEF1699 



PEP2000 
PEP2001 
PEP2002 



pEPlOll with rav j^. 
PEP1012 with ravR * 
pEPllOO with rav^" VF55. 
pEPllOO with raXL" FW58, 
pSP64 with TnS neo 

pEP1303 with deletion of Ap resistance 
gene, 

PEP1304 with ravL" VF55. 
PEP1304 with ravL* FW58. 
PEP1304 with rav , 

pEP1200 series plasmids with HIV 353-369 
substituted for Right Syioxnetrized Targets. 

pEP14 00 series plasmids containg 

modified SSHLr genes producing Rav^ proteins 

that complement the rav ^"" VF55 mutation, 

pEP1400 series plasmids containg 

modified rav p genes producing Ravj^ proteins 

that complement the rav L" FW58 mutation. 

pEP1009 with rav replaced by arc . 

p£P2000 with arc operator in Pneo, tet. 

pEP2001 with arc operator in Pamp, aalT.K . 
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p£P2003 
p£P2004 
vgl-pEP2005 

vg2-pEP2006 

vg3-pEP2007 

PEP2010 
PEP2011 
vgl-pEP2012 

PEP3000 
PEP4000 
PEP4001 
PEP4002 
vgl-pEP1233 



PEP2002 with Target#l in Pneo, tet. 

PEP2003 with Target#l in Pamp . aalT.K , 

PEP2004 with vgDNA (variegation #1 of 
polypeptide) • 

PEP2004 with vgDNA (variegation #2 of 
polypeptide) . 

PEP2004 with vgDNA (variegation #3 of 
polypeptide) . 

pEP2002 with Target #2 in Pneo, tet. 

pEP2010 with Target #2 in Pamn , aalT.K , 

PEP2011 with vgDKA (variegation #1 of 
residues 1-10) • 

pEP2004 with CI2-arcfl-lO^ in place of arc. 

PEP20O2 with Target #3 in Pneo . tet , 

pEP400a with Target#3 in PaSE/ qalT.K , 

PEP4001 with cro"hl2 in place of arc . 

PEP4002 with vgDNA (variegation #1 of 
polypeptide segment) . 
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Table 104 : £g tenniilatbr 
and multiple cloning site 
to insert after galT.K 



5 ' I CGA I AAG I GCT | CCT | TT T | GCA | GCC | TTT | TTT j TTT | - 
0li9#2 » 3* t tte caa aaa aaa eat cao aaa aaa aaaj- 

I fd terminator : | 



I ACT I AGT I CAG | TGG | CCC | GAC | TGG | CCG | TTA | AC 3 ' = 01ig#l 
[tea tea I ate acc ggq eta acc ggc aat tqg c 5' 
I Soel I I Sf il I I Hpal I 
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Table 105: Hutag nic Primer 
to Remove BamH I site from pEPlOOS 

[ t I p I V I 1 I w I i 1 

I 93 [ 94 I 95 1 96 1 97 1 98 I 
5' CC|ACA|CCC| 6TC|CTG| TG6|ATC|- 
3 • gg tcrt ada cacr aac acc taT- 



I 1 I y I a I g I r I i 1 
1 99] 100 1 101 1 102 1 103 1 1041 

I CTG I TAG I GCC [ GGA | CGC | ATC | GT 3 • pEPlOOS 

Aac ata egg cct acq tag ca 5' 01ig#7 



Bold, upper case bases indicate sites of mutation. 



Table 106: deleted. 
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TeOal 107: Synthesis of las£a£-fi&2^EII-£glII- 
Kpn l -trpa terminator 

5' I CTA I GTC I CAG I GCT { TTA | CKC | TTT | ATG | C TT | CCG } GCT I - 

Olig#9 = 3' ao ate ca a aat ota aaa tae aaa aac caa- 
I Spel I I • -35 I 

/3' = 01ig#8 ^ 

I CGT I ATA I ATG \ TGT { GGA I ATTIGTG | AGC | GGA | TA A | CAA I TTT I 
Ota tat tac aca cct taa cac tea cct att att aaa- 
I -10 I I lac o perator |. 

Olig #11 =3'/ 

/3« =01ig #10 

|CAC|ACA|GGT | AAC|GAGCAGAGA TCT a I TGC I GGT \ACC\- 

ota tat cca ttq otcctctct aaa t acq cca tgg- 

I BstEii I I pqj,j?;| ' I KpnT^ I 

Olig #13 =3'/ 

I AGT I CTA I AGC I CCG | CCT | AAT | GAG | CGG | GCT | TTT | TTT | TT - 
tca gat tea aac qqa tta etc acc caa aaa aaa aa - 
I spacer I troA terminator _|. 



G|GCC| CGA|C 3' = Olig #12 
c caa g 5* 

I sfli I 



Table 108: deleted. 
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Table 109: Synthesis of 
First segment of yav gene 



5< C I AGG \ AGG I TAA ( CCA | qcra | aaa | aat | aaa I - 

iPgtBTX I 

I ATGj GAA I CAA| CGC \ ATA [ ACC | CTA | AAG | GAC | T AG | GCG | ATG | CGC | - 



/3« = 01ig#14 
I TTT I GGC I CAA i Acd | AAG { ACA I GCG I AAG I GAC I CTA I 
01ig#15 = 3' gg ttc tot coc ttc eta oat- 
I Ball I I PpuI^X I 

I GGG I GTG I TAT | GAG | AGC | GCG | ATT | AAC \ AAG | GCC | ATC | 
CGC cac ata ate tea cac taa tta ttc ca a tao- 



I CAT I GCC I G6C [ CGA | AAG | ATC | TTC | CTG | 
ata egg oca act ttc tag aaa aac 5' 

|BalII I 



wo 90/07862 PCr/US9a/00024 

211 

Tcible 110: Second segment f yav gene 



|r|k|i|f|llt|i|n|a|d| 
I 38| 39| 40| 41| 42| 43 | 44 | 45| 46| 47| 
5 • C I CGA I AAG I ATC I TTC I CTA | ACC | ATT | AAC | GCT | GAT | 



I g I s I V I y I a I e I e I V I k I p I f I p I s I 
I 48l 49| 50| 51| 52l 53 1 54 1 55 1 56| 57 1 58 1 59 1 60| 
I GGA I TCC I GTC | TAC | GCG | GAA | GAG | GTA | AAG | CCC | TTC | CCG | AGT | 
I BamHI I [strand overlap | | Avajj | 



I n I k I k I t I t I a 1 . I . I . I 
I 61| 62| 63| 64| 65 1 66| 67 1 67 1 68 1 

|AAC|AAA|AAA|ACA|ACA|GCG|TAA|TAG|TAGlgta|cca|gtc|t 3' 

I Kpnl I 



a 

9 
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Table 111 
Used 



Kim al. Consensus-A 5 

3 

Systmetric Consensus-A 5 

3 

Or3A 5 

3 

0^3A/Syinm. -Consensus. 6 5 

3 
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: X Or core sequences 
to search HIV-1 

1234567 
CCGCGGG 3' 

G6CGCCC 5* Kim Consensus-S 
CC6CC66 3* 

GGCGGCC 5' Syinm. Consensus-S 

CCGCAAG 3' 
GGC6TTC 5' Or3S 

CCGCAGG 3' 

GGCGTCC 5' 0|(3S/Syinm. Cons. 2 



Oi(3A/Symm. Consensus. 5 5 

3 



CC6CCAG 3 
6GCG6TC 5 
7654321 



Oj^3S/Syiim i Cons . 3 
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Table 112 : Potential target binding sequences 
having subsegu nc s niatching 
six of seven bases 

I 

CC6CGGG Kim consensus-A 
HIV-1 subsequence =ACTTTCCGQtGGGGACT 
353 f 

I 

CCGCAGG Oj(3 A/consensiis • 6 
HIV-1 subsequence =TCTCGaCGCAGGACTCG 
681 t 

I 

CTTGCGG Or3S 
HIV-1 subsecpaence =TTTGACT^GCGGAGGCT 
760 t 
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Table 113: Po'kential targelt binding secjuences 
having subsequences matching five of seven bases 

SyiioDetric consensus-S CCGGCGG 

HIV-l subsequence GACTTTCCGctGGGGAC 
352 t 

0R3 S/ consensus . 2 CCTGCGG 

HIV-l subsequence TTTCCaCTGaGGACTTT 
355 t 

ORSS/sytoa consensus. 3 CTGGCGG 

HIV-l subsequence TAGCAgTGGCGcCCGAA 
630 I 

Symmetric consensus-A CCGCCGG 

HIV-l subsequence CAGTGgCGCCcGAACAG 
633 t 

Or3A/symm consensus .5 CCGCCAG 

HIV-l subsequence CAGTCgCGCCsGAACAG 
633 I 

0R3A/ consensus • 6 CCGCAGG 

HIV-l subsequence GACTAaCGaAGGCTAGA 
763 I 



symm consensus-S 

HIV-l subsequence 
763 



CCGGCGG 
GACTAG[CGGaGGCTAGA 
t 
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Table 113, continued: Potential target binding s quences 
having s\ibs quences matching five of seven bas s 



Or3A/syinm consensus. 5 CCGCCA6 

HIV-1 subsequence • GAAGAigGCCAGTAAAA 
4545 \ 

0R3 A/ consensus. 6 CCGCAGG 

HIV-1 subsequence ACAGAtgGCAGGTGATG 
5047 f 



0R3 A/ consensus . 6 CCGCAGG 

HIV-1 subsequence TCCTAtgGCAGGAAGAA 
5965 t 



wo 90/07862 



PCr/US90/00024 



216 

Table 114: Coding r gion of iaxL=22 gene 



m 


e 


q 


' r 


i 


t 


1 


k d y 


a 




r 


X 






M 

4 


3 


6 


7 


8 9 . lb 


11 


12 


13 


ATG 


GAA 


CAA 


CGC 


ATA 


ACC 


CTA 


AAG GAC TAG 


GCG 


ATG 


CGC 


f 


9 


R 


t 


k 


t 


a 


k d 1 








14 


15 


16 


17 


18 


19 


20 


21 22 23 








TTT 


6GC 


CGT 


ACC 


AAG 


ACA 


GCG 


AAG GAC CTA 






















IPPVW I 1 








g 


V 


H 


■ I 


T 


. a 


i 


Q N a 


i 






24 


25 


26 


27 


28 


29 


30 


31 32 33 


34 






GGG 


GTG 


CAT 


ATT 


ACG 


GCG 


ATT 


CAG AAT GCC 


ATC 






h 


a 


g 


K 


Q 


i 


f 


1 t i 


n 


a 


d 


35 


36 


37 


38 


39 


40 


41 


42 43 44 


45 


46 


47 


GAT 


GCC 


GGC 


AAG 


CAG 


ATC 


TTC 


CTA ACC ATT 


AAC 


GCT 


GAT 



g s V y a e 
48 49 50 51 52 53 

GGA TCC GTC TAG GCG GAA 

iBa^nHTl 



e V k p f p s 
54 55 56 57 58 59 60 
GAG GTA AAG CCC TTC CCG AGT 

|Ava I [ 



n k k t t a . , . 
61 62 63 64 65 66 67 67 68 
AAC AAA AAA ACA ACA GCG TAA TAG TA GGTACC 

|Kpnl| 



wo 90/07862 



PCr/US90/00024 



217 

Table 115: ravR-38 gene 



n 


e 


g 


r 


i 


t 


1 


k 


d 


y 


^ a 


n 


r 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


ATG 


GAA 


CAA 


CGC 


ATA 


ACC 


CTA 


AAG 


GAC 


TAC 


GCG 


ATG 


CGC 


f 


g 


E 


t 


k 


t 


a 


k 


d 


1 








14 


15 


16 


17 


18 


19 


20 


21 


22 


23 








TTT 


GGC 


GAG 


ACC 


AAG 


ACA 


GCG 


AAG 


GAC 


CTA 






















iPpuM I 1 








g 


V 


R 


T 


L 


a 


t 

1 


R 


D 


a 


i 






24 


25 


. 26 


27 


28 


29 


30 


31 


32 


33 


34 






GGG 


GTG 


CGT 


ACT 


CTT 


GCG 


ATT 


CGT 


GAT 


GCC 


ATC 






K 


a 


y 




u 






1 


t 


i 


n 


a 


d 


35 


36 


37 


38 


39 


40 


41 


42 


43 


44 


45 


46 


47 


AAG 


GCC 


GGC 


AAT 


CAT 


ATC 


TTC 


CTA 


ACC 


ATT 


AAC 


GCT 


GAT 


g 


S 


V 


y 


a 


e 


e 


V 


k 


P 


f 


P 


S. 


48 


49 








53 


54 


55 


56 


57 


58 


59 


60 


GGA 


TCC 


GTC 


TAC 


GCG 


GAA 


GAG 


GTA 


AAG 


CCC 


TTC 


GCG 


AGT 


1 BanHI 1 


















lAva I 1 


n 


k 


k 


t 


t 


a 


« 


• 


• 










61 


62 


63 


64 


65 


66 


67 


67 


68 










AAC 


AAA 


AAA 


ACA 


ACA 


GCG 


TAA 


TAG 


TA GGTACC 







|ypni| 
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Tcibl s for Example 2 

Table 200: 
P22 arc operator 



P22 ass 5" ATGATAGAAG|C|ACTCTACTAT 3* 

Operator 3' TACTATCTTC | G | TGAGATGATA 5» 



consensus 
of half- 
sites 



51 ATrrTAGArk|s|ittyTCTAyyAT 3« 
3» TAyyATCTym|s|krAGATrrTA 5» 



P22 arc left half operator 
P22 arc right half operator 



- ATrrTAGArk 
= myTCTAyyAT 
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Table 201 P22 Arc gene 
I m I k i g I n I s i k I 

|1|2|3|4|5|6| 
GG I TAA I CCT I ATG I AAG I GGT I ATG I TCT I AAA I - 



I m I p~l h I f I n I 1 I r I w I p I r I 
I 7 I 8 I 9 I 10 | 11] 12| 13| 14 1 15| 16 1 
I ATG I CCT I CAC | TTT | AAC | CTC | AGG | TGG | CCC | C6G | G- 

I BSU36I I I ym? 

|e|v|l|d|l|v|r|k|v|a| 
I 17| 18| 19| 20| 21| 22| 23 | 24 | 25| 26| 

I ag|gtc|ctt1gat|ctt|gtt|cgc|aag|gtt|gct|- 

I PpuM 1 1 

I e I e l.n I g I r I s I V I n I s I e I 
I 27| 28| 29| 30| 3l| 32| 33| 34 1 35| 36| 
I GAG I GAA I AAC | GGT | CGG | TCC | GTT | AAC | TCT | G | - 

I Rsr II I 

I spa I I 
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Table 201 ^ continued 

I i ( y I n I r I V I ml e | s | f | k | 
I 37| 38] 39| 40] 4l| 42| 43 | 44 | 45| 46| 
AG I ATT I TAT | AAT | CGC | GTT | ATG \ GAG | TCG | TTC | AA6 1 

I k I e I g I r I i I g I a I . 1 . I . I 
I 47| 48| 49| 50| 5l| 52 1 53 [ | | | 
I AAA I GAG I G6T | CGT | ATC ] GGC | GCA | TAA [ TAG | T6A | 



|GGT|ACC| 



Amino acid sequence encoded is identical to wild type P22 
Arc. 

DNA sequence designed for optimal placement of restriction 
sites. 
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Tabl 202 Synth sis of P22 Arc gene 

5 ' -G I TAA I CCT | ATS | AAG | GGT | ATG | TCT | AAA | - 

3 ' - aa tac t-tc cca tac aaa ttt - 
|BstE III 

/=olig#400 

I ATG I CCT I CAC | TTT | AAC | CTC | AGG | TGG | CCC | CGG | - 

tac gga qta aaa tta aaa tec acc qgq acc 3'= 

I BSU36I [ olig#405 

I GAG I GTC I CTT | GAT | CTT | GTT | CGC | AAG | GTT j GCt [ - 
etc cag aaa eta gaa caa acq ttc caa caa - 

/=olig#401 

|GAG|GAA| AAC | GGT | CGG | TCC | G TT | AAC | TCT | GAG | - 
etc ctt ttg cca acc agg c aa tta aaa ctc - 

\=olig#406 

/=olig#402 

I ATC I TAT I AAT | CGC | GTT | ATG | (?Ag | TgG j TTg | AAG|- 
taa ata tta aca caa tac etc age aag ;ttc - 

\=olig#407 

I AAA I GAG I GGT | CGT | ATC | GGC | GCA | TAA | TAG | TGA | - 

ttt etc cca aca tag cca cat att ate act - 

I GGT I AC 3* = Olig#403 
c 5« = olig#408 
I Kon I I 

Ntunber of bases in each oligonucleotide. 

400 a 43 401 » 48 402 = 42 

403 » 47 405 = 50 406 = 49 

407 = 38 408 = 34 
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Table 203: HIV-1 Subsequences 
that are similar to one half of 
the Arc Operator 



Number of 
mismatches 

1234567890|0987654321 
arco =ATrrTAGArlc 
HIV-1 subsequence =ATtATAtAATACAGTAGCAAC 2 
1019 t 

1234567890 | 0987654321 
arcO =ATrrTAGArk 
HIV-1 subsequence =ATAATAcAGTAGCAACCCTCT 1 
1024 t 

1234567890] 0987654321 
arcO = myTCTAyyAT 
HXV-1 subsequence =ACAGTAGCAACCCTCTATTgT 1 

1040 ] 

1234567890 | 0987654321 
arcO =ATrrTAGArk 
HIV-1 subsequence ^ATGATAGgGGGAATTGGAGGT 1 
2387 I 

1234567890 | 0987654321 
arcO =ATrrTAGArk 
HTV-l subsequence »tTGAcAGAAGAAAAAATAAAA 2 
2624 t 
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Table 204 Synth sis of Potential DBP-1 
vgl for pEP2004 

M K G M S K 
1 2 3 4 5 6 
R t -gpgSTACGG I TAA \ C CT | ATG | AAG | GGT | ATG | TCT I AAA I - 
|BstE II| 

2 2 2 2 2 2 
M P Q F |I/M|Q/R|D/V|R/I|W/G|D/G| 
7 8 9 1G|11| 12 I 13 I 14 1 15 1 16 1 
I ATG I CCT I CAC | TTT | AT s | CrG | GwT | AkA | kGG ) GrT I - 



|-3« = Olig#420 
2 2 2 2 2 42 2 2 

|Q/L|R/T|F/Y|R/C|W/G| V | Q | I/H| T/I |R/Q| 
I 17 1 18 1 19] 20 1 21 1 22) 23 1 24 1 25 { 26 1 
I CwG I ASA I TWT I VGT | kGG | GTG | CAG | AT s | AyC | CrG | 
3' -cc cac g tc taS tRa aYc- 



2222222222 
I V/I I R/I I F/Y I D/V I T/I I R/Q | V/I | D/G | V/I | P/Q | 
I 27| 28| 29| 30] 3l| 32 1 33| 34 1 35| 36| 
I rTT I AkA I TwT I GwT I Aye I CrG I rTT I GrT I rTT I CmG I 
Yaa tNt aWa cWa tJRa qYo Yaa cYa Yaa aKc- 



I TAA I TAG I TGA I AAC I CTC I AGG I CGTGATCC 
att ate act tta aa a tee acaetaaa ^5«=olig#421 
I BSU36I I spacer I 
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TaODle 204, continued: NOTES 

s = eguimolar C and G r = equimolar A and G 

w s eguimolar, A and T k » eguimolar T and G 

y = eguimolar T and C m » eguimolar A and C 

n = eguimolar A, C, G, and T 

There are 2^^ = (approx.) 1.6 x lo"^ DNA and protein seguen- 
ces. 

Number of bases in each oligonucleotide. 
420 = 86 421 = 73 



